Skip to content

Initial attempt at a standalone Arrow variant implementation#275

Open
CurtHagenlocher wants to merge 7 commits intoapache:mainfrom
CurtHagenlocher:ArrowVariant
Open

Initial attempt at a standalone Arrow variant implementation#275
CurtHagenlocher wants to merge 7 commits intoapache:mainfrom
CurtHagenlocher:ArrowVariant

Conversation

@CurtHagenlocher
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher commented Feb 28, 2026

What's Changed

This adds a new standalone assembly for dealing with Arrow/Parquet variants. The implementation was largely coded by AI, with significant input from me about API and design constraints.

I think the most important thing for review purposes is probably the API and the tests.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new standalone Apache.Arrow.Variant assembly implementing Arrow/Parquet “Variant” encoding with zero-copy readers, a materialized value model, JSON integration, conformance tests, and benchmarks.

Changes:

  • Introduces Apache.Arrow.Variant library (encoding helpers, metadata builder/reader, array/object readers, materialized VariantValue model).
  • Adds extensive xUnit test suite plus optional conformance tests driven by the apache/parquet-testing submodule vectors.
  • Adds BenchmarkDotNet benchmarks and wires the new package into release verification.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/parquet-testing Adds parquet-testing submodule pointer for conformance vectors.
test/Apache.Arrow.Variant.Tests/VariantValueTests.cs Adds unit tests for VariantValue factories/accessors/equality/materialization.
test/Apache.Arrow.Variant.Tests/VariantSqlDecimalTests.cs Adds tests covering Decimal16/SqlDecimal behaviors across encode/decode/JSON.
test/Apache.Arrow.Variant.Tests/VariantRoundTripTests.cs Adds encode→decode round-trip tests for primitives and containers.
test/Apache.Arrow.Variant.Tests/VariantReaderPrimitiveTests.cs Adds primitive-type coverage for VariantReader.
test/Apache.Arrow.Variant.Tests/VariantReaderObjectTests.cs Adds object reader tests for field lookup and error cases.
test/Apache.Arrow.Variant.Tests/VariantReaderArrayTests.cs Adds array reader tests and bounds/error cases.
test/Apache.Arrow.Variant.Tests/VariantMetadataTests.cs Adds metadata parsing/lookup tests for sorted/unsorted dictionaries.
test/Apache.Arrow.Variant.Tests/VariantJsonTests.cs Adds JSON converter/reader/writer tests and round-trips.
test/Apache.Arrow.Variant.Tests/VariantEncodingHelperTests.cs Adds tests for header packing and LE int helpers.
test/Apache.Arrow.Variant.Tests/VariantBuilderTests.cs Adds tests for builder encoding of primitives/containers.
test/Apache.Arrow.Variant.Tests/TestVectors.cs Adds hand-crafted binary vectors for reader/builder tests.
test/Apache.Arrow.Variant.Tests/ParquetTestingVectorTests.cs Adds conformance tests comparing decoded variants to parquet-testing JSON expectations.
test/Apache.Arrow.Variant.Tests/Apache.Arrow.Variant.Tests.csproj Adds new test project (multi-target on Windows).
test/Apache.Arrow.Variant.Benchmarks/StructVariantValue.cs Adds a struct-based variant model for benchmark comparisons.
test/Apache.Arrow.Variant.Benchmarks/Program.cs Adds BenchmarkDotNet entry point.
test/Apache.Arrow.Variant.Benchmarks/MixedWorkloadBenchmarks.cs Adds a mixed build/traverse workload benchmark.
test/Apache.Arrow.Variant.Benchmarks/EqualityBenchmarks.cs Adds equality benchmarks (class vs struct).
test/Apache.Arrow.Variant.Benchmarks/EncodingBenchmarks.cs Adds encode/decode/JSON benchmarks.
test/Apache.Arrow.Variant.Benchmarks/CreationBenchmarks.cs Adds creation/constructor benchmarks.
test/Apache.Arrow.Variant.Benchmarks/ArrayBenchmarks.cs Adds array fill/sum benchmarks.
test/Apache.Arrow.Variant.Benchmarks/AccessBenchmarks.cs Adds typed-accessor benchmarks.
test/Apache.Arrow.Variant.Benchmarks/Apache.Arrow.Variant.Benchmarks.csproj Adds new benchmarks project.
src/Apache.Arrow.Variant/VariantValue.cs Introduces materialized VariantValue discriminated-union struct.
src/Apache.Arrow.Variant/VariantReader.cs Introduces zero-copy VariantReader and primitive materialization.
src/Apache.Arrow.Variant/VariantPrimitiveType.cs Defines primitive type IDs.
src/Apache.Arrow.Variant/VariantObjectReader.cs Introduces zero-copy object reader with binary-search field lookup.
src/Apache.Arrow.Variant/VariantMetadataBuilder.cs Introduces metadata dictionary builder with UTF-8 sort + remap.
src/Apache.Arrow.Variant/VariantMetadata.cs Introduces zero-copy metadata reader with find-by-bytes.
src/Apache.Arrow.Variant/VariantEncodingHelper.cs Adds encoding header helpers and variable-width LE ints.
src/Apache.Arrow.Variant/VariantBasicType.cs Defines base type IDs.
src/Apache.Arrow.Variant/VariantArrayReader.cs Introduces zero-copy array reader.
src/Apache.Arrow.Variant/Properties/AssemblyInfo.cs Exposes internals to the new test assembly.
src/Apache.Arrow.Variant/Json/VariantJsonWriter.cs Adds binary→JSON writer plus VariantValue→JSON helper.
src/Apache.Arrow.Variant/Json/VariantJsonReader.cs Adds JSON→binary entry points via VariantBuilder.
src/Apache.Arrow.Variant/Json/VariantJsonConverter.cs Adds System.Text.Json converter for VariantValue.
src/Apache.Arrow.Variant/Apache.Arrow.Variant.csproj Adds new library project and package references.
dev/release/verify_rc.sh References the new package in release-candidate verification.
dev/release/rat_exclude_files.txt Updates RAT exclusions for parquet-testing vectors.
Directory.Packages.props Adds central package version for System.Data.Common.
Apache.Arrow.sln Adds Variant projects to the solution.
Apache.Arrow.Tests.slnf Adds Variant tests project to test solution filter.
.gitmodules Adds submodule configuration for parquet-testing.
Comments suppressed due to low confidence (2)

test/Apache.Arrow.Variant.Tests/ParquetTestingVectorTests.cs:1

  • actual.AsDecimal() can throw OverflowException for Decimal16 values stored as SqlDecimal (which is explicitly supported elsewhere in this PR). This will make conformance tests fail on large Decimal16 values. Use actual.AsSqlDecimal() for Decimal16 and compare via a non-throwing representation (e.g., compare string forms when the JSON is a plain integer/decimal literal, or branch on actual.IsSqlDecimalStorage).
    test/Apache.Arrow.Variant.Tests/VariantReaderPrimitiveTests.cs:1
  • This pattern (try/catch + Assert.Fail) is repeated in multiple tests and is harder to read/maintain than xUnit’s built-in Assert.Throws<T>() (which you already use elsewhere in this PR). Prefer Assert.Throws<InvalidOperationException>(() => reader.GetInt32()); to reduce boilerplate and make intent clearer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +477 to +495
/// <summary>Gets the object fields as a read-only dictionary.</summary>
public IReadOnlyDictionary<string, VariantValue> AsObject()
{
if (_primitiveType != ObjectSentinel)
{
throw new InvalidOperationException($"Cannot read object from variant type {_primitiveType}.");
}
return (Dictionary<string, VariantValue>)_objectValue;
}

/// <summary>Gets the array elements as a read-only list.</summary>
public IReadOnlyList<VariantValue> AsArray()
{
if (_primitiveType != ArraySentinel)
{
throw new InvalidOperationException($"Cannot read array from variant type {_primitiveType}.");
}
return (List<VariantValue>)_objectValue;
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the return types are IReadOnlyDictionary/IReadOnlyList, the implementation returns the underlying mutable Dictionary/List instances directly, which can be downcast and mutated. This undermines the “readonly struct / discriminated union” immutability expectations and can invalidate equality/hash semantics. Consider storing truly read-only wrappers (e.g., ReadOnlyDictionary / ReadOnlyCollection) or returning wrapper views to prevent mutation.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about it.

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't closely checked every line here, and I'm also not very familiar with the variant type, but FWIW this looks good to me 😄

I see there's a canonical arrow.parquet.variant extension type, is the plan to implement that as a follow-up, in the same assembly?

I guess something that we might want to add later is features for working with shredded data, like allowing VariantJsonReader and VariantBuilder to exclude values for shredded fields. But I think it should be straightforward to add that without breaking any existing APIs.

VariantReader reader = new VariantReader(metadata, value);
WriteReader(writer, reader);
}
return Encoding.UTF8.GetString(ms.GetBuffer(), 0, (int)ms.Position);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use checked here and in ToJson to catch overflows?

/// <summary>
/// Gets the primitive type when <see cref="BasicType"/> is <see cref="VariantBasicType.Primitive"/>.
/// </summary>
public VariantPrimitiveType PrimitiveType => VariantEncodingHelper.GetPrimitiveType(Header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should throw InvalidOperationException if BasicType isn't Primitive?

@CurtHagenlocher
Copy link
Contributor Author

Yes, both shredding and an extension type are on the menu. I'm a bit torn about the dependency layering for the extension type, though. I feel like the Parquet variant could be a useful type by itself without an Arrow dependency, so perhaps the type would live in the Arrow assembly instead of this one. What's your feeling?

@CurtHagenlocher
Copy link
Contributor Author

For instance, maybe Parquet.NET could make use of this as a standalone dependency (CC @aloneguid)

@adamreeve
Copy link
Contributor

Yes I guess the extension type will be quite lightweight and won't need any dependency on the new assembly, so it seems reasonable to add it to Apache.Arrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants