feat: implement DataWriter for Iceberg data files by shangxinli · Pull Request #552 · apache/iceberg-cpp

shangxinli · 2026-01-31T17:43:13Z

Implements DataWriter class for writing Iceberg data files as part of issue #441 (task 2).

Implementation:

Factory method DataWriter::Make() for creating writer instances
Support for Parquet and Avro file formats via WriterFactoryRegistry
Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID
Proper lifecycle management with Initialize/Write/Close/Metadata
PIMPL idiom for ABI stability

Related to #441

evindj · 2026-02-03T06:17:48Z

src/iceberg/data/data_writer.cc

+
+    ICEBERG_ASSIGN_OR_RAISE(writer_,
+                            WriterFactoryRegistry::Open(options_.format, writer_options));
+    return {};


It is odd that an empty structure is always returned. Also, since this is initialization why not doing in the ctor?

Refactored the initialization logic

evindj · 2026-02-03T06:19:55Z

src/iceberg/data/data_writer.cc

+    if (closed_) {
+      return InvalidArgument("Writer already closed");
+    }


I could see a case for making close idempotent, is there any strong reason why we want to return this error instead of no op for example?

evindj · 2026-02-03T06:23:31Z

src/iceberg/data/data_writer.cc

+      return InvalidArgument("Writer already closed");
+    }
+    ICEBERG_RETURN_UNEXPECTED(writer_->Close());
+    closed_ = true;


Should this class address thread safety?

Good question! I've added explicit documentation that this class is not thread-safe:

I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

@wgtmac out of curiosity for my own knowledge, what guarantees that a single writer/reader will be using the class?

These file writers are supposed to be used by a single write task, which for example can be an unit of a table sink operator in a sql job plan. Usually the writer is responsible for partitioned (and sometimes sorted) data chunks.

Agreed. Removed the thread safety comment from the header.

evindj · 2026-02-03T06:29:41Z

src/iceberg/test/data_writer_test.cc

+TEST_F(DataWriterTest, CreateWithParquetFormat) {
+  DataWriterOptions options{
+      .path = "test_data.parquet",
+      .schema = schema_,
+      .spec = partition_spec_,
+      .partition = PartitionValues{},
+      .format = FileFormatType::kParquet,
+      .io = file_io_,
+      .properties = {{"write.parquet.compression-codec", "uncompressed"}},
+  };
+
+  auto writer_result = DataWriter::Make(options);
+  ASSERT_THAT(writer_result, IsOk());
+  auto writer = std::move(writer_result.value());
+  ASSERT_NE(writer, nullptr);
+}
+
+TEST_F(DataWriterTest, CreateWithAvroFormat) {
+  DataWriterOptions options{
+      .path = "test_data.avro",
+      .schema = schema_,
+      .spec = partition_spec_,
+      .partition = PartitionValues{},
+      .format = FileFormatType::kAvro,
+      .io = file_io_,
+  };
+
+  auto writer_result = DataWriter::Make(options);
+  ASSERT_THAT(writer_result, IsOk());
+  auto writer = std::move(writer_result.value());
+  ASSERT_NE(writer, nullptr);
+}


nit: The two tests are quite similar, it is probably possible to leverage a function to reduce duplication

Consolidated the two tests using parameterized testing.

evindj · 2026-02-03T06:31:44Z

src/iceberg/test/data_writer_test.cc

+  // Check length before close
+  auto length_result = writer->Length();
+  ASSERT_THAT(length_result, IsOk());
+  EXPECT_GT(length_result.value(), 0);


nit: check the size of the data passed to the write function?

src/iceberg/data/data_writer.cc

zhjwpku · 2026-02-03T06:42:02Z

src/iceberg/data/data_writer.cc

+  }
+
+  Result<FileWriter::WriteResult> Metadata() {
+    if (!closed_) {


nit: use ICEBERG_CHECK here

zhjwpku · 2026-02-03T06:45:18Z

src/iceberg/test/data_writer_test.cc

+  EXPECT_GT(length.value(), 0);
+}
+
+}  // namespace


nit: move this closing namespace curly before the first TEST_F?

Implements DataWriter class for writing Iceberg data files as part of issue apache#441 (task 2). Implementation: - Static factory method DataWriter::Make() for creating writer instances - Support for Parquet and Avro file formats via WriterFactoryRegistry - Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID - Proper lifecycle management with Write/Close/Metadata methods - Idempotent Close() - multiple calls succeed (no-op after first) - PIMPL idiom for ABI stability - Not thread-safe (documented) Tests: - 13 comprehensive unit tests including parameterized format tests - Coverage: creation, write/close lifecycle, metadata generation, error handling, feature validation, and data size verification - All tests passing (13/13) Related to apache#441

src/iceberg/data/data_writer.cc

wgtmac · 2026-02-09T07:59:11Z

src/iceberg/data/data_writer.cc

+      return InvalidArgument("Writer already closed");
+    }
+    ICEBERG_RETURN_UNEXPECTED(writer_->Close());
+    closed_ = true;


I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

wgtmac · 2026-02-09T08:00:12Z

src/iceberg/data/data_writer.cc

+  }
+
+  Result<FileWriter::WriteResult> Metadata() {
+    ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");


Suggested change

ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");

ICEBERG_CHECK(closed_, "Cannot get metadata before closing the writer");

We should return invalid state instead of invalid argument in this case.

src/iceberg/data/data_writer.cc

src/iceberg/test/data_writer_test.cc

- Use aggregate initialization for WriterOptions and DataFile - Change ICEBERG_PRECHECK(writer_) to ICEBERG_DCHECK (can never fail) - Use ICEBERG_CHECK for closed state check (returns ValidationFailed) - Use value_or(-1) for missing row count to match Java impl - Use range constructors for metrics map conversion - Remove unnecessary thread safety comment - Use int32()/string() factory functions in tests - Consolidate test cases and add helpers to reduce boilerplate

wgtmac

LGTM. Thanks @shangxinli for working on this and @evindj for the review!

shangxinli force-pushed the implement-data-file-writer branch from 8944a75 to a201953 Compare January 31, 2026 17:59

evindj reviewed Feb 3, 2026

View reviewed changes

zhjwpku reviewed Feb 3, 2026

View reviewed changes

shangxinli force-pushed the implement-data-file-writer branch 2 times, most recently from 90d324e to 153d763 Compare February 7, 2026 01:31

shangxinli force-pushed the implement-data-file-writer branch from 153d763 to 147f25b Compare February 7, 2026 01:34

wgtmac reviewed Feb 9, 2026

View reviewed changes

shangxinli added 2 commits February 14, 2026 17:33

ci: retrigger

8ef6520

wgtmac approved these changes Feb 25, 2026

View reviewed changes

wgtmac merged commit 63e4ec0 into apache:main Feb 25, 2026
10 checks passed

	ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");
	ICEBERG_CHECK(closed_, "Cannot get metadata before closing the writer");

Conversation

shangxinli commented Jan 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shangxinli Feb 7, 2026 •

edited

Loading