Initial attempt at a standalone Arrow variant implementation#275
Initial attempt at a standalone Arrow variant implementation#275CurtHagenlocher wants to merge 7 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new standalone Apache.Arrow.Variant assembly implementing Arrow/Parquet “Variant” encoding with zero-copy readers, a materialized value model, JSON integration, conformance tests, and benchmarks.
Changes:
- Introduces
Apache.Arrow.Variantlibrary (encoding helpers, metadata builder/reader, array/object readers, materializedVariantValuemodel). - Adds extensive xUnit test suite plus optional conformance tests driven by the
apache/parquet-testingsubmodule vectors. - Adds BenchmarkDotNet benchmarks and wires the new package into release verification.
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/parquet-testing | Adds parquet-testing submodule pointer for conformance vectors. |
| test/Apache.Arrow.Variant.Tests/VariantValueTests.cs | Adds unit tests for VariantValue factories/accessors/equality/materialization. |
| test/Apache.Arrow.Variant.Tests/VariantSqlDecimalTests.cs | Adds tests covering Decimal16/SqlDecimal behaviors across encode/decode/JSON. |
| test/Apache.Arrow.Variant.Tests/VariantRoundTripTests.cs | Adds encode→decode round-trip tests for primitives and containers. |
| test/Apache.Arrow.Variant.Tests/VariantReaderPrimitiveTests.cs | Adds primitive-type coverage for VariantReader. |
| test/Apache.Arrow.Variant.Tests/VariantReaderObjectTests.cs | Adds object reader tests for field lookup and error cases. |
| test/Apache.Arrow.Variant.Tests/VariantReaderArrayTests.cs | Adds array reader tests and bounds/error cases. |
| test/Apache.Arrow.Variant.Tests/VariantMetadataTests.cs | Adds metadata parsing/lookup tests for sorted/unsorted dictionaries. |
| test/Apache.Arrow.Variant.Tests/VariantJsonTests.cs | Adds JSON converter/reader/writer tests and round-trips. |
| test/Apache.Arrow.Variant.Tests/VariantEncodingHelperTests.cs | Adds tests for header packing and LE int helpers. |
| test/Apache.Arrow.Variant.Tests/VariantBuilderTests.cs | Adds tests for builder encoding of primitives/containers. |
| test/Apache.Arrow.Variant.Tests/TestVectors.cs | Adds hand-crafted binary vectors for reader/builder tests. |
| test/Apache.Arrow.Variant.Tests/ParquetTestingVectorTests.cs | Adds conformance tests comparing decoded variants to parquet-testing JSON expectations. |
| test/Apache.Arrow.Variant.Tests/Apache.Arrow.Variant.Tests.csproj | Adds new test project (multi-target on Windows). |
| test/Apache.Arrow.Variant.Benchmarks/StructVariantValue.cs | Adds a struct-based variant model for benchmark comparisons. |
| test/Apache.Arrow.Variant.Benchmarks/Program.cs | Adds BenchmarkDotNet entry point. |
| test/Apache.Arrow.Variant.Benchmarks/MixedWorkloadBenchmarks.cs | Adds a mixed build/traverse workload benchmark. |
| test/Apache.Arrow.Variant.Benchmarks/EqualityBenchmarks.cs | Adds equality benchmarks (class vs struct). |
| test/Apache.Arrow.Variant.Benchmarks/EncodingBenchmarks.cs | Adds encode/decode/JSON benchmarks. |
| test/Apache.Arrow.Variant.Benchmarks/CreationBenchmarks.cs | Adds creation/constructor benchmarks. |
| test/Apache.Arrow.Variant.Benchmarks/ArrayBenchmarks.cs | Adds array fill/sum benchmarks. |
| test/Apache.Arrow.Variant.Benchmarks/AccessBenchmarks.cs | Adds typed-accessor benchmarks. |
| test/Apache.Arrow.Variant.Benchmarks/Apache.Arrow.Variant.Benchmarks.csproj | Adds new benchmarks project. |
| src/Apache.Arrow.Variant/VariantValue.cs | Introduces materialized VariantValue discriminated-union struct. |
| src/Apache.Arrow.Variant/VariantReader.cs | Introduces zero-copy VariantReader and primitive materialization. |
| src/Apache.Arrow.Variant/VariantPrimitiveType.cs | Defines primitive type IDs. |
| src/Apache.Arrow.Variant/VariantObjectReader.cs | Introduces zero-copy object reader with binary-search field lookup. |
| src/Apache.Arrow.Variant/VariantMetadataBuilder.cs | Introduces metadata dictionary builder with UTF-8 sort + remap. |
| src/Apache.Arrow.Variant/VariantMetadata.cs | Introduces zero-copy metadata reader with find-by-bytes. |
| src/Apache.Arrow.Variant/VariantEncodingHelper.cs | Adds encoding header helpers and variable-width LE ints. |
| src/Apache.Arrow.Variant/VariantBasicType.cs | Defines base type IDs. |
| src/Apache.Arrow.Variant/VariantArrayReader.cs | Introduces zero-copy array reader. |
| src/Apache.Arrow.Variant/Properties/AssemblyInfo.cs | Exposes internals to the new test assembly. |
| src/Apache.Arrow.Variant/Json/VariantJsonWriter.cs | Adds binary→JSON writer plus VariantValue→JSON helper. |
| src/Apache.Arrow.Variant/Json/VariantJsonReader.cs | Adds JSON→binary entry points via VariantBuilder. |
| src/Apache.Arrow.Variant/Json/VariantJsonConverter.cs | Adds System.Text.Json converter for VariantValue. |
| src/Apache.Arrow.Variant/Apache.Arrow.Variant.csproj | Adds new library project and package references. |
| dev/release/verify_rc.sh | References the new package in release-candidate verification. |
| dev/release/rat_exclude_files.txt | Updates RAT exclusions for parquet-testing vectors. |
| Directory.Packages.props | Adds central package version for System.Data.Common. |
| Apache.Arrow.sln | Adds Variant projects to the solution. |
| Apache.Arrow.Tests.slnf | Adds Variant tests project to test solution filter. |
| .gitmodules | Adds submodule configuration for parquet-testing. |
Comments suppressed due to low confidence (2)
test/Apache.Arrow.Variant.Tests/ParquetTestingVectorTests.cs:1
actual.AsDecimal()can throwOverflowExceptionfor Decimal16 values stored asSqlDecimal(which is explicitly supported elsewhere in this PR). This will make conformance tests fail on large Decimal16 values. Useactual.AsSqlDecimal()for Decimal16 and compare via a non-throwing representation (e.g., compare string forms when the JSON is a plain integer/decimal literal, or branch onactual.IsSqlDecimalStorage).
test/Apache.Arrow.Variant.Tests/VariantReaderPrimitiveTests.cs:1- This pattern (try/catch +
Assert.Fail) is repeated in multiple tests and is harder to read/maintain than xUnit’s built-inAssert.Throws<T>()(which you already use elsewhere in this PR). PreferAssert.Throws<InvalidOperationException>(() => reader.GetInt32());to reduce boilerplate and make intent clearer.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// <summary>Gets the object fields as a read-only dictionary.</summary> | ||
| public IReadOnlyDictionary<string, VariantValue> AsObject() | ||
| { | ||
| if (_primitiveType != ObjectSentinel) | ||
| { | ||
| throw new InvalidOperationException($"Cannot read object from variant type {_primitiveType}."); | ||
| } | ||
| return (Dictionary<string, VariantValue>)_objectValue; | ||
| } | ||
|
|
||
| /// <summary>Gets the array elements as a read-only list.</summary> | ||
| public IReadOnlyList<VariantValue> AsArray() | ||
| { | ||
| if (_primitiveType != ArraySentinel) | ||
| { | ||
| throw new InvalidOperationException($"Cannot read array from variant type {_primitiveType}."); | ||
| } | ||
| return (List<VariantValue>)_objectValue; | ||
| } |
There was a problem hiding this comment.
Although the return types are IReadOnlyDictionary/IReadOnlyList, the implementation returns the underlying mutable Dictionary/List instances directly, which can be downcast and mutated. This undermines the “readonly struct / discriminated union” immutability expectations and can invalidate equality/hash semantics. Consider storing truly read-only wrappers (e.g., ReadOnlyDictionary / ReadOnlyCollection) or returning wrapper views to prevent mutation.
There was a problem hiding this comment.
I'm thinking about it.
adamreeve
left a comment
There was a problem hiding this comment.
I haven't closely checked every line here, and I'm also not very familiar with the variant type, but FWIW this looks good to me 😄
I see there's a canonical arrow.parquet.variant extension type, is the plan to implement that as a follow-up, in the same assembly?
I guess something that we might want to add later is features for working with shredded data, like allowing VariantJsonReader and VariantBuilder to exclude values for shredded fields. But I think it should be straightforward to add that without breaking any existing APIs.
| VariantReader reader = new VariantReader(metadata, value); | ||
| WriteReader(writer, reader); | ||
| } | ||
| return Encoding.UTF8.GetString(ms.GetBuffer(), 0, (int)ms.Position); |
There was a problem hiding this comment.
Should we use checked here and in ToJson to catch overflows?
| /// <summary> | ||
| /// Gets the primitive type when <see cref="BasicType"/> is <see cref="VariantBasicType.Primitive"/>. | ||
| /// </summary> | ||
| public VariantPrimitiveType PrimitiveType => VariantEncodingHelper.GetPrimitiveType(Header); |
There was a problem hiding this comment.
Should throw InvalidOperationException if BasicType isn't Primitive?
|
Yes, both shredding and an extension type are on the menu. I'm a bit torn about the dependency layering for the extension type, though. I feel like the Parquet variant could be a useful type by itself without an Arrow dependency, so perhaps the type would live in the Arrow assembly instead of this one. What's your feeling? |
|
For instance, maybe Parquet.NET could make use of this as a standalone dependency (CC @aloneguid) |
|
Yes I guess the extension type will be quite lightweight and won't need any dependency on the new assembly, so it seems reasonable to add it to Apache.Arrow. |
What's Changed
This adds a new standalone assembly for dealing with Arrow/Parquet variants. The implementation was largely coded by AI, with significant input from me about API and design constraints.
I think the most important thing for review purposes is probably the API and the tests.