Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enables Avro as a fully-supported file format for both data files and manifests in the Paimon C++ SDK, aligning behavior with the Java implementation (including timestamp handling and the absence of per-column stats for Avro data files).
Changes:
- Implement a new Avro writer path based on a direct encoder (
AvroDirectEncoder) instead ofGenericDatum, and wire it intoAvroFormatWriter,AvroFileFormat, and the Avro CMake target. - Add an
AvroStatsExtractorthat produces schema-based (all-null) column stats for Avro files, plus comprehensive unit tests and extended integration tests to cover Avro manifests, tables, blob tables, global index behavior, and postponing bucket writers. - Adjust timestamp precision, JSON test data, and statistics expectations across integration tests to match Avro’s supported timestamp resolutions and semantics, and enable previously disabled Avro read tests.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
test/test_data/avro/append_multiple.db/append_multiple/README |
Updates expected snapshot contents for the Avro append-multiple test table to match the new Avro writer/read semantics and ordering. |
test/inte/write_inte_test.cpp |
Extends write integration tests to cover Avro (GetTestValuesForWriteInteTest), adjusts timestamp precision from nanos to millis and the associated JSON data and SimpleStats setup so that stats and expectations are compatible with Avro’s precision and behavior. |
test/inte/write_and_read_inte_test.cpp |
Switches manifest format in several tests from ORC to Avro and adds "avro" to the parameterized file formats, exercising end-to-end write/read with Avro manifes ts. |
test/inte/scan_and_read_inte_test.cpp |
Enables previously disabled Avro scan tests, strengthens assertions by concatenating chunks before comparison, and updates expected result JSON for Avro append-table reads. |
test/inte/global_index_test.cpp |
Adds Avro into the parameterization for global index tests while explicitly skipping index-scan cases for Avro, mirroring the lance behavior where index-based features aren’t supported. |
test/inte/data_evolution_table_test.cpp |
Includes Avro in parameterized data-evolution tests and consistently skips scenarios that rely on per-column stats or index functionality, noting that “lance and avro do not have stats”. |
test/inte/blob_table_inte_test.cpp |
Adds Avro to blob table tests and skips predicate/dense-stats/data-evolution tests for Avro in the same way as lance, since stats are not available. |
src/paimon/format/avro/avro_stats_extractor_test.cpp |
New tests that write Avro files with various primitive and nested types, then validate that AvroStatsExtractor returns all-null stats and that converting those stats to SimpleStats yields the expected hash codes. |
src/paimon/format/avro/avro_stats_extractor.h |
Introduces AvroStatsExtractor, a FormatStatsExtractor implementation for Avro that currently provides schema-driven, all-null ColumnStats and declares (but does not yet support) ExtractWithFileInfo. |
src/paimon/format/avro/avro_stats_extractor.cpp |
Implements AvroStatsExtractor::Extract by opening the Avro file via AvroFileFormat, reading the file schema, and creating appropriate ColumnStats objects per Arrow type (all with null min/max/null-count, including nested types). |
src/paimon/format/avro/avro_output_stream_impl.cpp |
Ensures each buffer flush also calls the underlying OutputStream::Flush(), so Avro blocks are reliably persisted before the internal buffer is reset. |
src/paimon/format/avro/avro_format_writer.h |
Refactors AvroFormatWriter to use ::avro::DataFileWriterBase and a new AvroDirectEncoder::EncodeContext (for scratch buffer reuse), increases the default sync interval, and removes the now-obsolete AvroAdaptor member. |
src/paimon/format/avro/avro_format_writer.cpp |
Updates writer construction to build a DataFileWriterBase, removes the GenericDatum conversion path, uses AvroDirectEncoder::EncodeArrowToAvro row-by-row, and implements ReachTargetSize using getCurrentBlockStart(). |
src/paimon/format/avro/avro_file_format_test.cpp |
Extends Avro file-format tests with complex/nested types, nested maps, and row-wise reads, comparing Arrow arrays round-tripped through Avro; however, several field(...) calls are missing the arrow:: qualifier and the tests call a non-existent arrow::decimal(2, 2), which will not compile. |
src/paimon/format/avro/avro_file_format.cpp |
Wires Avro stats extraction into the file format by returning an AvroStatsExtractor from CreateStatsExtractor instead of a NotImplemented status. |
src/paimon/format/avro/avro_direct_encoder.h |
Adds AvroDirectEncoder, a helper for encoding Arrow arrays directly to Avro using ::avro::Encoder, with an EncodeContext that reuses byte buffers for efficiency, adapted from Iceberg C++. |
src/paimon/format/avro/avro_direct_encoder.cpp |
Implements AvroDirectEncoder::EncodeArrowToAvro, handling primitives, timestamps (with Avro logical types), decimals, structs, lists, maps, and Avro unions directly from Arrow arrays; note that this file uses fmt::format but currently lacks an explicit #include "fmt/format.h", which will cause a compilation error. |
src/paimon/format/avro/avro_adaptor_test.cpp |
Removes the old Avro adapter test that validated conversion from Arrow arrays to GenericDatum, in favor of the new direct encoder tests. |
src/paimon/format/avro/avro_adaptor.h |
Removes the legacy AvroAdaptor class used for Arrow→GenericDatum conversion, now superseded by the direct encoder path. |
src/paimon/format/avro/avro_adaptor.cpp |
Deletes the previous implementation of AvroAdaptor::ConvertArrayToGenericDatums, fully dropping the GenericDatum-based write pipeline. |
src/paimon/format/avro/CMakeLists.txt |
Updates the Avro CMake target to drop avro_adaptor.*, add avro_direct_encoder.cpp and avro_stats_extractor.cpp, and include avro_stats_extractor_test.cpp in the test target. |
src/paimon/core/postpone/postpone_bucket_writer_test.cpp |
Extends the postpone-bucket writer tests to include Avro in their parameterization and to set the reader schema explicitly via ExportType when validating written data (especially important for Avro). |
src/paimon/core/operation/abstract_split_read.cpp |
Keeps Avro excluded from the prefetch path (similar to blob files), indicating that Avro readers are currently not used with the prefetching wrapper. |
src/paimon/core/core_options.cpp |
Removes the previous guard that rejected file.format=avro, allowing Avro to be used as a data file format in CoreOptions. |
src/paimon/common/utils/date_time_utils.h |
Adds GetArrowTimeUnitStr helper for turning arrow::TimeUnit enums into human-readable strings, used in improved Avro timestamp error messages. |
LICENSE |
Updates the NOTICE section to acknowledge the new Avro direct encoder files as code adapted from Apache Iceberg C++. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lxy-9602
reviewed
Feb 2, 2026
Collaborator
|
Please enrich the pr description. |
2 tasks
Collaborator
|
+1 |
lucasfang
approved these changes
Feb 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Add full support for writing Avro format files within the paimon-cpp.
Implement high-performance Avro writing functionality refer to the implementation of Apache Iceberg C++: Directly encode Arrow array data to Avro without GenericDatum.
Linked issue: #31
Tests
AvroFileBatchReaderTest.TestGetNumberOfRows
AvroFileFormatTest.TestComplexTypes
AvroFileFormatTest.TestNestedMap
AvroFormatWriterTest.*
AvroStatsExtractorTest.*
ScanAndReadInteTest.TestWithPKWithDvBatchScanSnapshot6 // for avro
ScanAndReadInteTest.TestAvroWithAppendTable
ScanAndReadInteTest.TestAvroWithPkTable
API and Format
GetNumberOfRows()->Result<uint64_t> GetNumberOfRows()orc->avroDocumentation