Skip to content

feat: Add support for Run-End Encoded arrays (REE) in Arrow .NET#260

Open
JorgeCandeias wants to merge 2 commits intoapache:mainfrom
JorgeCandeias:feature/run-end-encoding
Open

feat: Add support for Run-End Encoded arrays (REE) in Arrow .NET#260
JorgeCandeias wants to merge 2 commits intoapache:mainfrom
JorgeCandeias:feature/run-end-encoding

Conversation

@JorgeCandeias
Copy link

This PR attempts to add support for Run-End Encoded arrays by following established codebase patterns.

Notably:

  • New ArrowTypeId added.
  • New array type RunEndEncodedArray added.
  • New visitor method to handle the new array type.
  • New entry in the IPC serializer field type switch.
  • New RunEndEncodedType nested type.
  • Basic feature tests.

This PR is missing targeted performance benchmarks as I could not see an established structure to add them into. Please let me know if, where and how you would like me to create these. They could be useful to observe performance variance from various data sparsity patterns and decide what to optimise.

Please let me know if this PR is missing anything else.

Introduced RunEndEncodedType and RunEndEncodedArray classes to represent run-end encoded arrays, including validation and logical length calculation. Integrated REE support into ArrowArrayFactory and IPC serialization/deserialization (ArrowStreamWriter, ArrowReaderImplementation, ArrowTypeFlatbufferBuilder, MessageSerializer). Added unit tests for REE array creation, validation, serialization, and indexing. This enables efficient handling of consecutive runs of the same value in Arrow .NET.
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this @JorgeCandeias!

I've left a few inline suggestions and questions.

I think one major issue with this change at the moment though is that a RunEndEncodedArray won't behave correctly when it has been sliced. FindPhysicalIndex should handle when the Offset of the ArrayData is non-zero, and also when the Length is less than the last run-end value.

switch (runEnds)
{
case Int16Array int16Array:
return int16Array.GetValue(int16Array.Length - 1) ?? 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a null run-end value here should throw an exception rather than return 0.

long? lastValue = int64Array.GetValue(int64Array.Length - 1);
if (lastValue.HasValue && lastValue.Value > int.MaxValue)
{
throw new ArgumentException("Run ends value exceeds maximum supported length.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any blocker to making this work with int64 indices instead of int, or was this just chosen to align with default .NET array indexing? It seems like we could choose to use int64 indices here and allow larger lengths?

while (left < right)
{
int mid = left + (right - left) / 2;
int runEnd = runEnds.GetValue(mid) ?? 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also throw here if we encounter a null run-end.

/// It contains two child arrays: run_ends (Int16/Int32/Int64) and values (any type).
/// The run_ends array stores the cumulative end positions of each run.
/// </summary>
public class RunEndEncodedArray : Array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this had a helper method to make it easy to iterate over the expanded values. I guess this would only be possible for Values arrays that are a PrimitveArray. Eg. maybe something like a public IEnumerator<TValue> GetEnumerator<TValue>() method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something similar for dictionary-encoded values but hadn't yet gotten to the point of playing with implementation ideas. Ideally, a consumer ought to be able to say e.g "give me an IReadOnlyList<DateTimeOffset> for this column" and be able to get one whether it's straight values or encoded as a dictionary or as REE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be nice. Maybe this is something that should be done later in a consistent way for dictionary and REE arrays rather than as part of this PR then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants