Skip to content

feat(format): support avro write#99

Merged
zjw1111 merged 15 commits intoalibaba:mainfrom
zjw1111:support_avro_write
Feb 5, 2026
Merged

feat(format): support avro write#99
zjw1111 merged 15 commits intoalibaba:mainfrom
zjw1111:support_avro_write

Conversation

@zjw1111
Copy link
Collaborator

@zjw1111 zjw1111 commented Jan 30, 2026

Purpose

Add full support for writing Avro format files within the paimon-cpp.

Implement high-performance Avro writing functionality refer to the implementation of Apache Iceberg C++: Directly encode Arrow array data to Avro without GenericDatum.

Linked issue: #31

Tests

AvroFileBatchReaderTest.TestGetNumberOfRows
AvroFileFormatTest.TestComplexTypes
AvroFileFormatTest.TestNestedMap
AvroFormatWriterTest.*
AvroStatsExtractorTest.*
ScanAndReadInteTest.TestWithPKWithDvBatchScanSnapshot6 // for avro
ScanAndReadInteTest.TestAvroWithAppendTable
ScanAndReadInteTest.TestAvroWithPkTable

API and Format

  • FileBatchReader interface: GetNumberOfRows() -> Result<uint64_t> GetNumberOfRows()
  • default manifest format: orc -> avro

Documentation

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables Avro as a fully-supported file format for both data files and manifests in the Paimon C++ SDK, aligning behavior with the Java implementation (including timestamp handling and the absence of per-column stats for Avro data files).

Changes:

  • Implement a new Avro writer path based on a direct encoder (AvroDirectEncoder) instead of GenericDatum, and wire it into AvroFormatWriter, AvroFileFormat, and the Avro CMake target.
  • Add an AvroStatsExtractor that produces schema-based (all-null) column stats for Avro files, plus comprehensive unit tests and extended integration tests to cover Avro manifests, tables, blob tables, global index behavior, and postponing bucket writers.
  • Adjust timestamp precision, JSON test data, and statistics expectations across integration tests to match Avro’s supported timestamp resolutions and semantics, and enable previously disabled Avro read tests.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/test_data/avro/append_multiple.db/append_multiple/README Updates expected snapshot contents for the Avro append-multiple test table to match the new Avro writer/read semantics and ordering.
test/inte/write_inte_test.cpp Extends write integration tests to cover Avro (GetTestValuesForWriteInteTest), adjusts timestamp precision from nanos to millis and the associated JSON data and SimpleStats setup so that stats and expectations are compatible with Avro’s precision and behavior.
test/inte/write_and_read_inte_test.cpp Switches manifest format in several tests from ORC to Avro and adds "avro" to the parameterized file formats, exercising end-to-end write/read with Avro manifes ts.
test/inte/scan_and_read_inte_test.cpp Enables previously disabled Avro scan tests, strengthens assertions by concatenating chunks before comparison, and updates expected result JSON for Avro append-table reads.
test/inte/global_index_test.cpp Adds Avro into the parameterization for global index tests while explicitly skipping index-scan cases for Avro, mirroring the lance behavior where index-based features aren’t supported.
test/inte/data_evolution_table_test.cpp Includes Avro in parameterized data-evolution tests and consistently skips scenarios that rely on per-column stats or index functionality, noting that “lance and avro do not have stats”.
test/inte/blob_table_inte_test.cpp Adds Avro to blob table tests and skips predicate/dense-stats/data-evolution tests for Avro in the same way as lance, since stats are not available.
src/paimon/format/avro/avro_stats_extractor_test.cpp New tests that write Avro files with various primitive and nested types, then validate that AvroStatsExtractor returns all-null stats and that converting those stats to SimpleStats yields the expected hash codes.
src/paimon/format/avro/avro_stats_extractor.h Introduces AvroStatsExtractor, a FormatStatsExtractor implementation for Avro that currently provides schema-driven, all-null ColumnStats and declares (but does not yet support) ExtractWithFileInfo.
src/paimon/format/avro/avro_stats_extractor.cpp Implements AvroStatsExtractor::Extract by opening the Avro file via AvroFileFormat, reading the file schema, and creating appropriate ColumnStats objects per Arrow type (all with null min/max/null-count, including nested types).
src/paimon/format/avro/avro_output_stream_impl.cpp Ensures each buffer flush also calls the underlying OutputStream::Flush(), so Avro blocks are reliably persisted before the internal buffer is reset.
src/paimon/format/avro/avro_format_writer.h Refactors AvroFormatWriter to use ::avro::DataFileWriterBase and a new AvroDirectEncoder::EncodeContext (for scratch buffer reuse), increases the default sync interval, and removes the now-obsolete AvroAdaptor member.
src/paimon/format/avro/avro_format_writer.cpp Updates writer construction to build a DataFileWriterBase, removes the GenericDatum conversion path, uses AvroDirectEncoder::EncodeArrowToAvro row-by-row, and implements ReachTargetSize using getCurrentBlockStart().
src/paimon/format/avro/avro_file_format_test.cpp Extends Avro file-format tests with complex/nested types, nested maps, and row-wise reads, comparing Arrow arrays round-tripped through Avro; however, several field(...) calls are missing the arrow:: qualifier and the tests call a non-existent arrow::decimal(2, 2), which will not compile.
src/paimon/format/avro/avro_file_format.cpp Wires Avro stats extraction into the file format by returning an AvroStatsExtractor from CreateStatsExtractor instead of a NotImplemented status.
src/paimon/format/avro/avro_direct_encoder.h Adds AvroDirectEncoder, a helper for encoding Arrow arrays directly to Avro using ::avro::Encoder, with an EncodeContext that reuses byte buffers for efficiency, adapted from Iceberg C++.
src/paimon/format/avro/avro_direct_encoder.cpp Implements AvroDirectEncoder::EncodeArrowToAvro, handling primitives, timestamps (with Avro logical types), decimals, structs, lists, maps, and Avro unions directly from Arrow arrays; note that this file uses fmt::format but currently lacks an explicit #include "fmt/format.h", which will cause a compilation error.
src/paimon/format/avro/avro_adaptor_test.cpp Removes the old Avro adapter test that validated conversion from Arrow arrays to GenericDatum, in favor of the new direct encoder tests.
src/paimon/format/avro/avro_adaptor.h Removes the legacy AvroAdaptor class used for Arrow→GenericDatum conversion, now superseded by the direct encoder path.
src/paimon/format/avro/avro_adaptor.cpp Deletes the previous implementation of AvroAdaptor::ConvertArrayToGenericDatums, fully dropping the GenericDatum-based write pipeline.
src/paimon/format/avro/CMakeLists.txt Updates the Avro CMake target to drop avro_adaptor.*, add avro_direct_encoder.cpp and avro_stats_extractor.cpp, and include avro_stats_extractor_test.cpp in the test target.
src/paimon/core/postpone/postpone_bucket_writer_test.cpp Extends the postpone-bucket writer tests to include Avro in their parameterization and to set the reader schema explicitly via ExportType when validating written data (especially important for Avro).
src/paimon/core/operation/abstract_split_read.cpp Keeps Avro excluded from the prefetch path (similar to blob files), indicating that Avro readers are currently not used with the prefetching wrapper.
src/paimon/core/core_options.cpp Removes the previous guard that rejected file.format=avro, allowing Avro to be used as a data file format in CoreOptions.
src/paimon/common/utils/date_time_utils.h Adds GetArrowTimeUnitStr helper for turning arrow::TimeUnit enums into human-readable strings, used in improved Avro timestamp error messages.
LICENSE Updates the NOTICE section to acknowledge the new Avro direct encoder files as code adapted from Apache Iceberg C++.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lxy-9602
Copy link
Collaborator

lxy-9602 commented Feb 2, 2026

Please enrich the pr description.

@lucasfang
Copy link
Collaborator

+1

@zjw1111 zjw1111 merged commit cbacf3a into alibaba:main Feb 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants