fix: make ResponseToFileExtractor respect configured delimiter instead of hardcoding comma#928
Conversation
…d of hardcoding comma ResponseToFileExtractor was using pd.read_csv(dialect='unix') which hardcodes a comma delimiter, ignoring the configured download_decoder's CsvDecoder delimiter setting. This caused tab-separated (TSV) files to be parsed as a single column. Changes: - Add 'delimiter' parameter to ResponseToFileExtractor (default: ',') - Replace dialect='unix' with explicit CSV options + configurable delimiter - Add delimiter property to declarative component schema - Pass delimiter from model to component in factory - Add tests for tab-delimited CSV parsing Related: #909 (gzip decompression fix for ResponseToFileExtractor) Co-Authored-By: Ilja Herdt <ilja.herdt@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1772576576-fix-response-to-file-extractor-delimiter#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1772576576-fix-response-to-file-extractor-delimiterPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
Co-Authored-By: Ilja Herdt <ilja.herdt@airbyte.io>
📝 WalkthroughWalkthroughThis change introduces configurable CSV delimiter support to ResponseToFileExtractor, allowing users to specify custom delimiters (defaulting to comma) for parsing CSV data. The delimiter parameter is added across the schema definition, model, implementation, factory, and test suite with consistent application throughout. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2386-2389: Could we switch to an explicitNonecheck for delimiter fallback, wdyt?On Line 2388,
model.delimiter or ","will also coerce an explicitly provided empty string to comma. Would you prefer preserving non-Nonevalues so invalid inputs fail fast instead of silently changing behavior?Suggested tweak
return ResponseToFileExtractor( parameters=model.parameters or {}, - delimiter=model.delimiter or ",", + delimiter=model.delimiter if model.delimiter is not None else ",", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py` around lines 2386 - 2389, The return that constructs ResponseToFileExtractor currently uses the truthy fallback "model.delimiter or ','", which will replace an explicit empty-string delimiter; change it to an explicit None check so only None falls back to comma (e.g., use model.delimiter if model.delimiter is not None else ","), leaving other values (including "") unchanged; update the call site where ResponseToFileExtractor is instantiated (the code that references model.parameters and model.delimiter) to use this None-check pattern and keep the parameters fallback as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`:
- Around line 2386-2389: The return that constructs ResponseToFileExtractor
currently uses the truthy fallback "model.delimiter or ','", which will replace
an explicit empty-string delimiter; change it to an explicit None check so only
None falls back to comma (e.g., use model.delimiter if model.delimiter is not
None else ","), leaving other values (including "") unchanged; update the call
site where ResponseToFileExtractor is instantiated (the code that references
model.parameters and model.delimiter) to use this None-check pattern and keep
the parameters fallback as-is.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/declarative_component_schema.yamlairbyte_cdk/sources/declarative/extractors/response_to_file_extractor.pyairbyte_cdk/sources/declarative/models/declarative_component_schema.pyairbyte_cdk/sources/declarative/parsers/model_to_component_factory.pyunit_tests/sources/declarative/extractors/test_response_to_file_extractor.py
PyTest Results (Full)3 876 tests 3 864 ✅ 11m 18s ⏱️ Results for commit f846554. |
Summary
ResponseToFileExtractorhardcodeddialect="unix"in itspd.read_csv()call, which forces a comma delimiter regardless of the user's configuration. When used as thedownload_extractorin async streams, the configureddownload_decoder's CsvDecoder delimiter (e.g.\tfor TSV files) was completely ignored, causing all columns to be merged into a single field.This PR adds a
delimiterparameter toResponseToFileExtractorand replacesdialect="unix"with explicit CSV options that respect the configured delimiter. The factory is updated to pass the delimiter from the declarative model to the component.Changes:
response_to_file_extractor.py— Adddelimiterfield (default","), decode escaped delimiters (e.g.\\t→\t), replacedialect="unix"with explicit equivalent options + configurable delimiterdeclarative_component_schema.yaml— Adddelimiterproperty toResponseToFileExtractormodel_to_component_factory.py— Passdelimiterfrom model to componentdeclarative_component_schema.py— Regenerated viapoetry run poe assembletest_response_to_file_extractor.py— Add tests for tab, escaped-tab, default comma, and comma-in-values scenariosRelated: #909 (gzip decompression fix for ResponseToFileExtractor)
Review & Testing Checklist for Human
dialect="unix"replacement is behaviorally equivalent for existing comma-delimited connectors. The unix dialect setsquotechar='"',doublequote=True,lineterminator='\n',quoting=QUOTE_ALL. The new code sets all of these exceptquotechar(which defaults to'"'in pandas). Consider whetherquotechar='"'should be set explicitly for safety.__post_init__:self.delimiter.startswith("\\")triggersunicode_escapedecoding. Confirm this handles all YAML-serialized delimiter values correctly and doesn't misfire on unexpected inputs.ResponseToFileExtractorwith a tab-delimited download (e.g. the App Store Connect analytics connector from the original bug report) to confirm the delimiter flows through from the manifest YAML all the way to parsed records.declarative_component_schema.py— most changes are formatting-only from the code generator re-running. Confirm no unexpected semantic changes slipped in.Notes
Summary by CodeRabbit
Release Notes
New Features
Tests