Skip to content

Make record extractors self-contained; drop BaseRecordExtractor.convert dispatcher#18436

Merged
Jackie-Jiang merged 1 commit intoapache:masterfrom
Jackie-Jiang:base_record_extractor_minimize
May 7, 2026
Merged

Make record extractors self-contained; drop BaseRecordExtractor.convert dispatcher#18436
Jackie-Jiang merged 1 commit intoapache:masterfrom
Jackie-Jiang:base_record_extractor_minimize

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

Summary

Remove the shared BaseRecordExtractor.convert dispatcher and its helpers (convertSingleValue, convertMultiValue, convertCollection, convertArray, convertPrimitiveArray, convertMap, convertRecord, isMultiValue, isMap, isRecord). Each format extractor now owns its conversion logic.

BaseRecordExtractor keeps only:

  • init / _fields / _extractAll / initConfig — include-list resolution
  • stringifyMapKey — shared map-key contract serialization

What changed per extractor

Extractor Before After
JSONRecordExtractor called super.convert(...) inline convert / convertList / convertMap (BigInteger → BigDecimal; ListObject[]; MapMap<String, Object>; scalar pass-through)
CLPLogRecordExtractor called super.convert(...) same shape as JSON
ThriftRecordExtractor called super.convert(...); overrode isRecord / convertRecord inline convert / convertSingleValue / convertCollection / convertMap / convertRecord
Avro / Parquet (avro + native) / Arrow / ORC / ProtoBuf / CSV already self-contained unchanged

Other adjustments

  • Format-specific convert methods drop the inherited @Nullable where the new impl never returns null (JSON / CLPLog / Thrift). CSVRecordExtractor.convert gains @Nullable to match its existing null-on-empty behavior.
  • BaseRecordExtractorTest slimmed to cover the surviving surface (init, stringifyMapKey); the dispatcher tests are exercised through the per-format extractor tests.

🤖 Generated with Claude Code

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

❌ Patch coverage is 84.84848% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.58%. Comparing base (d7981af) to head (113244b).

Files with missing lines Patch % Lines
...ugin/inputformat/thrift/ThriftRecordExtractor.java 79.54% 2 Missing and 7 partials ⚠️
...t/plugin/inputformat/json/JSONRecordExtractor.java 95.23% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18436      +/-   ##
============================================
+ Coverage     63.57%   63.58%   +0.01%     
- Complexity     1717     1729      +12     
============================================
  Files          3252     3252              
  Lines        199132   199111      -21     
  Branches      30875    30874       -1     
============================================
+ Hits         126595   126603       +8     
+ Misses        62461    62427      -34     
- Partials      10076    10081       +5     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.58% <84.84%> (+0.01%) ⬆️
temurin 63.58% <84.84%> (+0.01%) ⬆️
unittests 63.58% <84.84%> (+0.01%) ⬆️
unittests1 55.66% <4.54%> (-0.01%) ⬇️
unittests2 34.91% <84.84%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jackie-Jiang Jackie-Jiang added ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior labels May 7, 2026
@Jackie-Jiang Jackie-Jiang requested a review from Copilot May 7, 2026 10:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Pinot’s record-extractor layer to remove the shared BaseRecordExtractor.convert(...) dispatcher and related helpers, making each input-format extractor responsible for its own type/shape conversion while keeping BaseRecordExtractor focused on include-list resolution (init) and shared map-key stringification (stringifyMapKey).

Changes:

  • Removed the conversion dispatcher and helper methods from BaseRecordExtractor, retaining only init/include-list handling and stringifyMapKey.
  • Inlined conversion logic into JSONRecordExtractor, CLPLogRecordExtractor, and ThriftRecordExtractor.
  • Updated tests to reflect the new ownership boundaries (base tests slimmed; format tests updated) and aligned CSVRecordExtractor.convert nullability with its behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/test/java/org/apache/pinot/spi/data/readers/BaseRecordExtractorTest.java Slimmed test coverage to init include-list behavior and stringifyMapKey.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java Removed shared conversion dispatcher; keeps include-list resolution and stringifyMapKey.
pinot-plugins/pinot-input-format/pinot-thrift/src/test/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractorTest.java Updated commentary to match Thrift extractor’s new self-contained conversion path.
pinot-plugins/pinot-input-format/pinot-thrift/src/main/java/org/apache/pinot/plugin/inputformat/thrift/ThriftRecordExtractor.java Added Thrift-local conversion for collections/maps/nested structs and scalar normalization.
pinot-plugins/pinot-input-format/pinot-json/src/test/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractorTest.java Updated commentary to reflect extractor-owned (not base-owned) BigInteger widening.
pinot-plugins/pinot-input-format/pinot-json/src/main/java/org/apache/pinot/plugin/inputformat/json/JSONRecordExtractor.java Added JSON-local conversion for BigInteger/List/Map recursion.
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVRecordExtractor.java Marked convert as @Nullable and simplified empty/null check via StringUtils.isEmpty.
pinot-plugins/pinot-input-format/pinot-clp-log/src/main/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractor.java Added JSON-like local conversion for non-CLP-encoded fields and refreshed class documentation.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal SPI compatibility issue; see inline comment.

@Jackie-Jiang Jackie-Jiang added the backward-incompat Introduces a backward-incompatible API or behavior change label May 7, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the base_record_extractor_minimize branch from ccd1a3a to 8cd9073 Compare May 7, 2026 16:49
@Jackie-Jiang Jackie-Jiang requested a review from Copilot May 7, 2026 16:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

…rt dispatcher

Remove the shared `BaseRecordExtractor.convert` dispatcher and its helpers
(`convertSingleValue` / `convertMultiValue` / `convertCollection` /
`convertArray` / `convertPrimitiveArray` / `convertMap` / `convertRecord` /
`isMultiValue` / `isMap` / `isRecord`). Each format extractor now owns its
own conversion logic.

Inline format-specific `convert` (single-value, list, map, record dispatch
order) in `JSONRecordExtractor`, `CLPLogRecordExtractor`, `ThriftRecordExtractor`.
The other extractors (Avro / Parquet native + avro / Arrow / ORC / ProtoBuf /
CSV) were already self-contained.

`BaseRecordExtractor` now provides only:
- `init` / `_fields` / `_extractAll` / `initConfig` — include-list resolution
- `stringifyMapKey` — shared map-key contract serialization

`@Nullable` annotations on the per-format `convert` methods now reflect each
implementation's actual null-return behavior (none of the inlined ones return
null; CSV's gains `@Nullable` since empty input → null).
@Jackie-Jiang Jackie-Jiang force-pushed the base_record_extractor_minimize branch from 8cd9073 to 113244b Compare May 7, 2026 17:10
@Jackie-Jiang Jackie-Jiang added the cleanup Code cleanup or removal of dead code label May 7, 2026
@Jackie-Jiang Jackie-Jiang merged commit 7b292c1 into apache:master May 7, 2026
14 of 16 checks passed
@Jackie-Jiang Jackie-Jiang deleted the base_record_extractor_minimize branch May 7, 2026 18:30
@xiangfu0
Copy link
Copy Markdown
Contributor

xiangfu0 commented May 8, 2026

Opened a follow-up docs PR for this change: pinot-contrib/pinot-docs#803

xiangfu0 added a commit to pinot-contrib/pinot-docs that referenced this pull request May 8, 2026
## Summary
- update the custom record-reader plugin guide so it no longer says
BaseRecordExtractor provides the shared value-conversion dispatcher
- describe BaseRecordExtractor as shared scaffolding for include-list
handling and map-key stringification
- restate the current RecordExtractor contract for custom extractor
authors, cross-checked against apache/pinot#18436

## Validation
- git diff --check
- relative-link-check on edited pages
- cross-checked against apache/pinot source for #18436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backward-incompat Introduces a backward-incompatible API or behavior change cleanup Code cleanup or removal of dead code ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants