Skip to content

fix: use spill writer's schema instead of the first batch schema for spill files#21293

Open
gruuya wants to merge 1 commit intoapache:mainfrom
splitgraph:union-nullable-spill-fix
Open

fix: use spill writer's schema instead of the first batch schema for spill files#21293
gruuya wants to merge 1 commit intoapache:mainfrom
splitgraph:union-nullable-spill-fix

Conversation

@gruuya
Copy link
Copy Markdown
Contributor

@gruuya gruuya commented Apr 1, 2026

Which issue does this PR close?

Rationale for this change

InProgressSpillFile will use the first batch schema when it has no writer, which could have different nullability from the subsequent batches (e.g. due to union-ing a literal projection and a table column projection). This can then lead to a panic in sort_batch.

InProgressSpillFile already has access to the canonical schema that the spill file should have (self.spill_writer.schema()).

What changes are included in this PR?

Single line fix: instead of using the first batch schema for the spill file schema, use the spill_writer's schema instead.

The rest of the changes are two new tests.

Are these changes tested?

Yes, two new tests are added.

Are there any user-facing changes?

@gruuya gruuya force-pushed the union-nullable-spill-fix branch from debff98 to fd58942 Compare April 1, 2026 08:58
@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Apr 1, 2026
@gruuya gruuya force-pushed the union-nullable-spill-fix branch 2 times, most recently from 635b511 to d0bcb4e Compare April 1, 2026 10:58
@gruuya gruuya force-pushed the union-nullable-spill-fix branch from d0bcb4e to 2f08c97 Compare April 1, 2026 11:19
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gruuya -- this makes sense to me

}
}

#[cfg(test)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the unit tests are needed give the end to end tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually of the opposite opinion—the e2e test is slightly unwieldy to me, and convoluted in a way, since it tries to reproduce the bug using non-standard primitives and hit the right spill conditions in an unnatural way (because otherwise it's hard to repro). For instance union/repartition exec used directly instead of sql/dataframe API, special task ctx, sort_batch called directly on batches instead of wrapping with SortExec.

Also InProgressSpillFile spill file unit tests were missing anyway.

Either way, I'm fine with dropping either one (or none) of those, let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming panic when spilling batches with mixed nullability

2 participants