Skip to content

Sync dotnet CsvFileReader with Python multiline-field fix#2

Merged
sharpninja merged 2 commits intomainfrom
copilot/sync-dotnet-with-python-changes
Feb 28, 2026
Merged

Sync dotnet CsvFileReader with Python multiline-field fix#2
sharpninja merged 2 commits intomainfrom
copilot/sync-dotnet-with-python-changes

Conversation

Copy link

Copilot AI commented Feb 28, 2026

Description

Syncs the .NET codebase with Python changes merged since the last dotnet commit (Feb 23). The primary functional gap was CsvFileReader silently corrupting multiline quoted CSV fields by parsing line-by-line — matching the Python fix in graphrag-input microsoft#2248.

Related Issues

Syncs with upstream Python commits:

Proposed Changes

dotnet/src/GraphRag.Input/CsvFileReader.cs

  • Replaced ReadLineAsync loop + single-line ParseCsvLine with a full-content ParseCsvContent character-by-character parser
  • Now correctly handles: embedded \n/\r\n in quoted fields, "" escape sequences, commas inside quoted values
  • Added guards: all-empty header row → return empty; all-empty trailing row → skip

Before, this CSV would silently produce wrong output:

title,text
"Post 1","Line one.
Line two.
Line three."
"Post 2","Single line."

After, docs[0].Text == "Line one.\nLine two.\nLine three." — matching Python csv.DictReader behaviour.

dotnet/tests/GraphRag.Tests.Unit/Input/CsvFileReaderTests.cs (new)

  • 6 tests: basic CSV, multiline quoted field (mirrors Python test_csv_loader_preserves_multiline_fields), comma-inside-quote, escaped double-quote, empty content, no-match

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

The vector store batching changes from microsoft#2251 (LanceDB, Azure AI Search, CosmosDB) required no dotnet changes — the existing implementations already use equivalent batch semantics. LanceDB remains a NotImplementedException stub pending an official .NET SDK.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…2248)

Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title [WIP] Sync dotnet code with python changes since last commit Sync dotnet CsvFileReader with Python multiline-field fix Feb 28, 2026
@sharpninja sharpninja marked this pull request as ready for review February 28, 2026 20:46
Copilot AI review requested due to automatic review settings February 28, 2026 20:46
@sharpninja sharpninja merged commit f4e1001 into main Feb 28, 2026
17 of 24 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Syncs the .NET CsvFileReader behavior with the Python multiline-CSV fix by switching from line-by-line parsing to a full-content parser so quoted fields can safely contain embedded newlines.

Changes:

  • Replaced the line-based CSV parsing loop with a character-by-character ParseCsvContent implementation to support quoted multiline fields and escaped quotes.
  • Added guards for empty headers / trailing content and skipped simple blank lines.
  • Added a new unit test suite for CSV parsing scenarios (multiline fields, commas-in-quotes, escaped quotes, empty inputs, no matches).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
dotnet/src/GraphRag.Input/CsvFileReader.cs Implements full-content CSV parsing to preserve multiline quoted fields and improve handling of quotes/commas.
dotnet/tests/GraphRag.Tests.Unit/Input/CsvFileReaderTests.cs Adds unit coverage for CSV parsing, including multiline quoted-field preservation.
Comments suppressed due to low confidence (1)

dotnet/src/GraphRag.Input/CsvFileReader.cs:137

  • ParseCsvContent adds a row on every \r/\n boundary without checking whether the row is entirely empty. This means a trailing row like ",," (or any all-empty row terminated by a newline) will be kept and later turned into an empty TextDocument, even though the method’s final flush explicitly skips all-empty trailing rows. Consider skipping row creation when all collected fields are empty/whitespace (e.g., check fields.TrueForAll(string.IsNullOrWhiteSpace) before rows.Add), and add a unit test covering a trailing all-empty row with multiple columns.
            else if (c == '\r' || c == '\n')
            {
                // End of row — consume \r\n as a single line ending.
                fields.Add(field.ToString());
                field.Clear();
                rows.Add([.. fields]);
                fields.Clear();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants