diff --git a/docs/internals/adr/deviations/001-dual-parquet-merge-engines.md b/docs/internals/adr/deviations/001-dual-parquet-merge-engines.md
new file mode 100644
index 00000000000..b8cd0d3d1d2
--- /dev/null
+++ b/docs/internals/adr/deviations/001-dual-parquet-merge-engines.md
@@ -0,0 +1,178 @@
+# Deviation 001: Dual Parquet merge engines during streaming-engine rollout
+
+## Summary
+
+Two Parquet merge engines coexist in production behind a runtime YAML flag.
+The streaming engine (`execute_merge_operation`) matches the intent of
+[ADR-003 §4](../003-time-windowed-sorted-compaction.md) (page-granular
+streaming, bounded memory). The in-memory engine
+(`merge_sorted_parquet_files`) is retained as the runtime fallback so an
+operator can flip back via configuration if the streaming engine hits a
+production bug. The dual-engine state is intentional and time-bounded —
+it ends when the streaming engine has soaked at the new default in
+production.
+
+## Related ADR
+
+- **ADR**: [ADR-003 Time-Windowed Sorted Compaction](../003-time-windowed-sorted-compaction.md)
+- **Section**: §4 Sorted Merge, Phase 2 (column streaming)
+
+## ADR States
+
+> Phase 2: Stream columns through the merge.
+>
+> Once the global sort order is determined, each column is read from the
+> input splits and written to the output in sorted order. Columns are
+> processed one at a time (or in small groups) for memory efficiency.
+>
+> For large columns, it may be advantageous to operate at **page
+> granularity** rather than loading an entire column from each input:
+> read individual Parquet pages from inputs as needed and write
+> individual pages to the output. This bounds memory usage for columns
+> with large values (e.g., high-cardinality string tags, large attribute
+> maps) and avoids materializing an entire column across all inputs
+> simultaneously.
+
+ADR-003 §4 describes the merge as a streaming operation that bounds
+memory by reading and writing pages incrementally. The in-memory
+`merge_sorted_parquet_files` engine pre-materializes whole columns from
+all inputs simultaneously — directly contrary to the ADR's stated
+memory model.
+
+## Current Implementation
+
+Both engines live in `quickwit-parquet-engine/src/merge/`:
+
+- **Streaming engine** (`execute_merge_operation`, in `merge/mod.rs`,
+  backed by `merge/streaming.rs`). Column-major, page-bounded body cache,
+  reads inputs through `RemoteByteSource`. This is the
+  ADR-003-compliant implementation. It is the unconditional path for
+  promotion merges (the in-memory path can't handle mixed
+  `rg_partition_prefix_len`) and the opt-in path for regular merges.
+- **In-memory engine** (`merge_sorted_parquet_files`, in `merge/mod.rs`).
+  Buffers all inputs through arrow-rs into memory, runs the merge under
+  `run_cpu_intensive`. This is the original bootstrap implementation
+  retained as the runtime fallback.
+
+`ParquetMergeExecutor::handle` routes between them:
+
+```rust
+let is_promotion = scratch.merge_operation.target_prefix_len_override.is_some();
+if is_promotion || self.use_streaming_engine {
+    execute_merge_operation(&op, sources, &output_dir, &config).await
+} else {
+    run_cpu_intensive(move || {
+        merge_sorted_parquet_files(&input_paths, &output_dir_clone, &config)
+    }).await
+}
+```
+
+The `use_streaming_engine` boolean is sourced from the node-level
+`IndexerConfig::parquet_merge_use_streaming_engine` YAML field, default
+`false`.
+
+Row-content equivalence between the two engines is enforced by parity
+tests in `quickwit-parquet-engine/src/merge/tests.rs::parity`. These
+must keep passing as long as both engines coexist.
+
+## Signal Impact
+
+Applies to **metrics** (the only signal currently using the Parquet
+pipeline). Will apply to **traces** and **logs** when those signals
+adopt Parquet splits. The deviation does not affect Tantivy-backed
+pipelines.
+
+## Impact
+
+| Aspect | ADR Target | Current Reality |
+|--------|------------|-----------------|
+| Engines in production | One streaming engine | Two (streaming + in-memory) |
+| Memory model | Page-bounded; ~constant per column | In-memory engine: O(total input column size) per merge |
+| Configuration surface | None — engine choice is internal | One YAML flag (`parquet_merge_use_streaming_engine`) |
+| Code surface to maintain | One engine | Two engines + parity tests + routing branch |
+| Operator rollback | Not applicable — only one path | Flip flag to `false`, no redeploy needed |
+
+## Why This Exists
+
+The streaming engine is new code. ADR-003 describes the target memory
+model but does not guarantee bug-free first-deployment behavior. Three
+forces produced the dual-engine state:
+
+1. **Production safety.** The in-memory engine has been the live merge
+   path during the metrics pipeline's bring-up. Replacing it wholesale
+   on a single PR, without an in-place fallback, would mean any bug in
+   the streaming engine requires a redeploy to recover. With the flag,
+   recovery is `config edit + restart`.
+2. **Staged rollout.** Production confidence is built by enabling the
+   streaming engine on a soak fleet, observing for some time, then
+   flipping the default. The dual-engine state is the necessary
+   infrastructure for that rollout.
+3. **Parity verifiable, not certain.** The parity tests in
+   `merge::tests::parity` cover representative synthetic fixtures.
+   Production data has shapes those fixtures don't cover. The fallback
+   exists because parity is a strong-but-not-total guarantee.
+
+## Priority Assessment
+
+- **PoC / MVP**: acceptable — dual-engine is in fact the deliberate
+  state during MVP rollout.
+- **Production (current)**: acceptable — flag defaults to `false`,
+  rollout has not begun. The streaming engine is exercised only by
+  promotion merges (whose execution will start once GAP-011 is closed).
+- **Production (post-soak)**: not acceptable. Once the streaming engine
+  has soaked at default-`true` in production, the in-memory engine
+  becomes dead code that complicates the merge-executor and obscures
+  the ADR-003 memory contract. Resolve before merging additional
+  significant work into `parquet_merge_executor.rs`.
+
+## Exit Criteria
+
+The deviation resolves when **all** of the following hold:
+
+1. `IndexerConfig::default_parquet_merge_use_streaming_engine` defaults
+   to `true` in `quickwit-config`.
+2. At least one production fleet has run with the flag set to `true` for
+   a soak window of ≥ 2 weeks with no merge-correctness incidents (no
+   data loss, no schema mismatch, no merge-output-rows-≠-input-rows
+   alerts).
+3. No deviation-resolving rollback has been issued during the soak.
+
+When those are met, the follow-up PR:
+
+- Deletes `merge_sorted_parquet_files` from `quickwit-parquet-engine`.
+- Deletes the in-memory branch in `ParquetMergeExecutor::handle`.
+- Deletes the `use_streaming_engine` field on `ParquetMergeExecutor` and
+  `ParquetMergePipelineParams`.
+- Deletes `IndexerConfig::parquet_merge_use_streaming_engine`.
+- Deletes `merge::tests::parity` (both engines no longer exist to
+  compare).
+- Closes this deviation.
+
+## Work Required to Match ADR
+
+| Change | Difficulty | Description |
+|--------|------------|-------------|
+| Flip default to `true` | Trivial | One-line change in `IndexerConfig::default_parquet_merge_use_streaming_engine`. Lands after soak. |
+| Production soak | Operational | Run with `true` on at least one fleet for ≥ 2 weeks, monitor merge-correctness signals. |
+| Delete in-memory engine | Moderate | Remove `merge_sorted_parquet_files`, the fallback branch, the flag, and the parity tests. Mechanically straightforward but touches several call sites. |
+
+## Recommendation
+
+**Accept for now.** The dual-engine state is the deliberate output of a
+flag-with-fallback rollout pattern (see commit history of #6441 and
+related PRs). Resolution is a known follow-up, not technical debt that
+needs to be paid down ahead of schedule.
+
+Track the exit criteria in this doc. When all three conditions hold,
+open the deletion PR and close this deviation.
+
+## References
+
+- [ADR-003 Time-Windowed Sorted Compaction](../003-time-windowed-sorted-compaction.md) §4
+- [GAP-011 No legacy promotion planner](../gaps/011-no-legacy-promotion-planner.md)
+- [GAP-012 Merge downloads instead of streaming](../gaps/012-merge-downloads-instead-of-streaming.md)
+- PR #6441 (wire-in of the YAML flag)
+
+## Date
+
+2026-05-18
diff --git a/docs/internals/adr/deviations/README.md b/docs/internals/adr/deviations/README.md
index 000c2e5e0e5..2ddb1dce09f 100644
--- a/docs/internals/adr/deviations/README.md
+++ b/docs/internals/adr/deviations/README.md
@@ -94,8 +94,7 @@ Deviation files use sequential numbering: `001-short-description.md`
 
 | Deviation | Title | Related ADR | Priority |
 |-----------|-------|-------------|----------|
-
-*No deviations recorded yet.*
+| [001](./001-dual-parquet-merge-engines.md) | Dual Parquet merge engines during streaming-engine rollout | [ADR-003](../003-time-windowed-sorted-compaction.md) | Accept until post-soak |
 
 ## Lifecycle
 
diff --git a/docs/internals/adr/gaps/011-no-legacy-promotion-planner.md b/docs/internals/adr/gaps/011-no-legacy-promotion-planner.md
new file mode 100644
index 00000000000..ec1e5642e34
--- /dev/null
+++ b/docs/internals/adr/gaps/011-no-legacy-promotion-planner.md
@@ -0,0 +1,136 @@
+# GAP-011: No Planner-Level Legacy Promotion
+
+**Status**: Open
+**Discovered**: 2026-05-18
+**Context**: Codex review on PR #6423 (`feat(merge): legacy promotion path + body-col schema evolution`) flagged that the promotion path is wired end-to-end at the library + executor layer but has no production trigger at the planner / policy level.
+
+## Problem
+
+The streaming Parquet merge stack now contains a complete *legacy promotion* pipeline:
+
+- `ParquetMergeOperation::promote_legacy(splits, target_prefix_len)` constructs an operation with
+  `target_prefix_len_override = Some(target)`.
+- `merge::execute_merge_operation` routes each input through `LegacyInputAdapter` when its
+  declared `rg_partition_prefix_len < target` and through `StreamingParquetReader` otherwise. The
+  streaming engine then sees a homogeneous stream advertising `prefix_len = target` on every
+  input.
+- `ParquetMergeExecutor` (in `quickwit-indexing`) detects `target_prefix_len_override.is_some()`
+  and routes those merges through `execute_merge_operation` (with `LocalFileByteSource`) instead
+  of the in-memory `merge_sorted_parquet_files` path.
+- `merge_parquet_split_metadata` accepts a `mixed_prefix_ok: bool` flag so the post-merge
+  aggregator skips the input-side equality check.
+
+What's missing: **nothing in the planner ever creates a `promote_legacy` operation in
+production**. `MergePolicyState::record_split` buckets each split by
+`CompactionScope::from_split`, and that scope key includes `rg_partition_prefix_len`. Legacy
+splits (`prefix_len = 0`) and aligned splits (`prefix_len > 0`) therefore land in *different*
+buckets before `ParquetMergePolicy::operations` ever runs. The production policy then iterates
+each bucket independently and emits `ParquetMergeOperation::new` (regular merge). A repo-wide
+search finds `promote_legacy` only in tests.
+
+In a mixed deployment (legacy + aligned splits coexisting), legacy splits therefore stay in
+their `prefix_len = 0` bucket forever — never gaining the prefix alignment that downstream
+locality compaction depends on. The promotion plumbing is reachable only from tests.
+
+## Evidence
+
+- `quickwit-parquet-engine/src/merge/policy/mod.rs`: `ParquetMergePolicy::operations` calls
+  `ParquetMergeOperation::new(...)` only. `promote_legacy` is constructed only by tests in the
+  same file.
+- `MergePolicyState::record_split` keys its `BTreeMap` by `CompactionScope::from_split`. The
+  scope derivation includes `rg_partition_prefix_len`, so a legacy split and a prefix-aligned
+  split with otherwise identical sort fields / window / merge level are never compared by the
+  policy.
+- The executor branch added in PR #6423 (`scratch.merge_operation.target_prefix_len_override
+  .is_some()`) routes promotion through `execute_merge_operation`. Library coverage at
+  `test_promote_legacy_executor_end_to_end` exercises a `prefix_len = 0` + `prefix_len = 1` pair
+  successfully. But that operation is only ever constructed inside the test.
+
+## State of the Art
+
+- **Iceberg**: Compaction policies inspect file-level metadata (partitioning, sort order) and
+  can rewrite files to align with the latest table partitioning even when individual files
+  pre-date the change. The compaction service treats schema-evolution-style rewrites as
+  first-class operations.
+- **Husky**: Background re-organization passes that promote files into newer storage layouts.
+  Tracked separately from the size-tiered compaction policy so cost trade-offs can be tuned.
+
+In both cases, the design separates the *trigger* (decision to promote) from the *mechanism*
+(how the promotion is performed). Quickwit currently has the mechanism but not the trigger.
+
+## Potential Solutions
+
+### Option A: Merge legacy + aligned buckets in `CompactionScope::from_split`
+
+Drop `rg_partition_prefix_len` from the scope key (or normalize it to a target value before
+bucketing). The policy then sees legacy and aligned splits as candidates for the same
+compaction operation and `ParquetMergePolicy::operations` decides whether to emit a regular
+merge or a `promote_legacy` operation based on whether the bucket contains mixed prefix
+lengths.
+
+Simplest change, but requires the policy to detect mixed-prefix buckets and choose between
+`new` and `promote_legacy` per operation.
+
+### Option B: Dedicated promotion pass
+
+Run a separate pass before the regular compaction policy that scans for legacy splits and emits
+`promote_legacy` operations for them. The regular policy then sees only aligned splits.
+
+Cleaner separation of concerns, but means legacy splits are migrated *before* any opportunity
+to coalesce them with aligned neighbors in a single multi-input merge — possibly more work
+overall.
+
+### Option C: Hybrid — bucket together, prefer single-pass promotion
+
+Keep scope bucketing as in option A. Inside the policy, when a bucket contains mixed prefix
+lengths AND has enough splits to merit a multi-input merge, emit `promote_legacy`. When only
+legacy splits exist (no aligned neighbor), emit `promote_legacy` with the same target — single-
+input promotion is still valuable because it converts the file to the new format for future
+locality compaction.
+
+Most flexible; gives the policy the freedom to amortize promotion cost when there are aligned
+neighbors AND to still promote isolated legacy splits in the background.
+
+## Signal Impact
+
+Primarily affects **metrics** in the near term: the legacy split format pre-dates the
+prefix-aligned RG layout, and only metrics has both formats in flight today. Traces and logs
+on the Parquet path will eventually reach the same state if a layout change ever happens; the
+same planner machinery would cover them.
+
+## Cost Considerations
+
+Promotion is strictly more expensive than a regular merge: the legacy adapter buffers the full
+input file in memory and re-encodes it as a single-RG stream before the merge engine sees it.
+For 50 MB metrics splits this is acceptable; for larger inputs the in-memory buffer is the
+gating cost.
+
+The planner should account for this when scheduling — promotion is best amortized into a
+multi-input merge rather than performed as a standalone file rewrite. Option C's "prefer
+multi-input promotion, fall back to single-input" structure captures this.
+
+## Impact
+
+- **Severity**: Medium. Legacy splits accumulate cost (every query against them pays the
+  prefix-less scan cost) but correctness is preserved — the locality compaction stack still
+  works on aligned splits.
+- **Frequency**: Persistent. Legacy splits never migrate without an explicit trigger.
+- **Affected Areas**: `quickwit-parquet-engine/src/merge/policy/`, `quickwit-parquet-engine/src/merge/mod.rs` (`MergePolicyState::record_split` + `CompactionScope`).
+
+## Next Steps
+
+- [ ] Decide between options A / B / C based on operational priorities and benchmark data.
+- [ ] Design the policy-level "should promote?" heuristic: how many legacy splits before
+      triggering, whether to wait for aligned neighbors, how to deprioritize promotion vs
+      regular compaction.
+- [ ] Add metrics for `legacy_splits_pending_promotion` and `promotion_operations_emitted` so we
+      can observe the policy in production.
+- [ ] Wire whichever option is chosen, with an integration test that exercises the full path
+      (legacy split → planner → executor → published prefix-aligned split).
+
+## References
+
+- PR #6423 (legacy promotion path + body-col schema evolution).
+- Codex review comment id `4311184497` (raised the gap).
+- `test_promote_legacy_executor_end_to_end` in `quickwit-parquet-engine::merge::streaming` —
+  library-level coverage of the mechanism.
diff --git a/docs/internals/adr/gaps/012-merge-downloads-instead-of-streaming.md b/docs/internals/adr/gaps/012-merge-downloads-instead-of-streaming.md
new file mode 100644
index 00000000000..871cdb84bb3
--- /dev/null
+++ b/docs/internals/adr/gaps/012-merge-downloads-instead-of-streaming.md
@@ -0,0 +1,164 @@
+# GAP-012: Parquet Merge Executor Downloads Inputs Instead of Streaming Them
+
+**Status**: Open
+**Discovered**: 2026-05-18
+**Context**: Code review of the Parquet streaming merge stack (PRs #6407–#6428) — specifically the executor wiring on #6423 — surfaced the question of why the merge actor downloads every input to local disk before merging when the streaming engine is designed around `RemoteByteSource`.
+
+## Problem
+
+The Parquet streaming merge engine in `quickwit-parquet-engine` consumes inputs through a
+minimal `RemoteByteSource` trait (`file_size`, `get_slice`, `get_slice_stream`). The trait was
+deliberately defined so the engine can pull pages column-major directly from object storage —
+two GETs per input (footer + body stream) and the merge progresses as bytes arrive, holding
+only the page-bounded engine state in memory.
+
+The actor pipeline in `quickwit-indexing` doesn't use that design. The
+`ParquetMergeSplitDownloader` actor pulls each input via `storage.copy_to_file(remote_path,
+local_path)` into a scratch directory, then hands `Vec<PathBuf>` to the
+`ParquetMergeExecutor`. The executor then either:
+
+- Calls the in-memory `merge_sorted_parquet_files(input_paths, ...)` (regular merges), which
+  reads each file fully into Arrow RecordBatches before merging, OR
+- Wraps each local path in a `LocalFileByteSource` and calls `execute_merge_operation` (added in
+  PR #6423 for promotion merges only).
+
+Either way, the streaming engine's central design benefit — overlapping the fetch with the
+merge and skipping the scratch disk entirely — is unused in production. Every merge reads each
+input twice: once over the network into scratch, once off scratch through the merger.
+
+## Evidence
+
+- `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_split_downloader.rs`: per-split
+  loop calling `self.storage.copy_to_file(...)` to materialize every input on local disk before
+  forwarding `ParquetMergeScratch` to the executor.
+- `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs`: receives
+  `downloaded_parquet_files: Vec<PathBuf>` and chooses between the in-memory path or
+  `execute_merge_operation` with `LocalFileByteSource` wrappers — never a `RemoteByteSource`
+  that actually streams from object storage.
+- `quickwit-parquet-engine/src/storage/streaming_reader.rs:62-67`: the `RemoteByteSource` trait
+  doc explicitly notes that callers in `quickwit-indexing` "provide a thin adapter that
+  delegates to `quickwit_storage::Storage`." The adapter exists in principle but isn't wired up
+  for the merge executor.
+
+## State of the Art
+
+- **ClickHouse `MergeTree`**: parts are accessed via the same storage abstraction whether the
+  merge runs locally or against tiered/object storage. There's no separate "download then
+  merge" actor pair — the merger reads parts where they live.
+- **Iceberg compaction**: data files are read directly from object storage by the compaction
+  job. Local scratch is used only for the output file before commit.
+- **Husky**: column-major streaming merge reads directly from blob storage. Designed around the
+  "two GETs per input" model the Quickwit streaming engine inherits.
+
+Across these systems, downloading inputs before merging is treated as a fallback for
+operational reasons (unreliable network, kernel page-cache effects), not the default.
+
+## Trade-offs
+
+### Why download-first is the current default
+- **Retry locality**: the downloader actor centralizes retry/backoff/timeout for one file at a
+  time. A mid-fetch S3 hiccup retries the download alone; the merger sees only successful
+  downloads.
+- **Pure-compute executor**: once files are on disk the executor has no network dependency.
+  Mid-merge failures are restricted to disk I/O and compute errors.
+- **Predictable disk budget**: scratch usage is bounded by `Σ input_sizes` per concurrent
+  merge. Easy to reason about; easy to cap.
+- **Legacy in-memory path**: `merge_sorted_parquet_files` predates the streaming engine and
+  requires local file paths. The download-first pattern existed before there was a streaming
+  alternative.
+
+### What download-first costs
+- **2× I/O per merge**: each input is transferred over the network into scratch AND read off
+  scratch into the merger. The kernel page cache mitigates the disk-read pass to some extent but
+  doesn't fully erase it.
+- **Serialized phases**: the merge can't start until *all* inputs are downloaded. First-byte
+  latency on the merger is `max(input download time)` instead of `min(input first-byte time)`.
+- **Scratch disk usage**: a typical compaction merging 8× 50 MB splits holds 400 MB of scratch
+  per merge, multiplied by the concurrent merge count. On lightweight indexer pods this caps
+  parallelism.
+- **Underused design**: the streaming engine's single-body-GET model + page-bounded memory was
+  built specifically for the no-scratch-disk case. Wiring through `LocalFileByteSource` works
+  but bypasses the property the design was built around.
+
+### What streaming-directly would cost
+- **Mid-merge retry surface**: a connection failure mid-body-GET kills the merge attempt
+  entirely. Single-body-GET is forward-only — no partial recovery. The retry surface becomes
+  "the merge failed after 30 % of work," not "the download failed, retry the file."
+- **Per-merge S3 connection count**: an N-way merge holds N concurrent body streams plus N
+  footer connections. On dense merger nodes this multiplies.
+- **Tail latency**: the merge progresses at the speed of the slowest input. With downloads,
+  parallel fetches average out; with streaming a slow input throttles the whole merge.
+
+## Potential Solutions
+
+### Option A: Streaming-directly when the input is reachable, download as fallback
+
+The executor receives a hint from the storage layer (or detects mid-merge failure rates) and
+chooses per merge. Splits stored on reliable, low-latency backends go through `RemoteByteSource`
+adapters that talk directly to `quickwit_storage::Storage`; on flaky or high-latency backends
+the downloader actor still materializes files first.
+
+Largest design lift but matches what mature compaction systems do.
+
+### Option B: Stream-directly by default, fall back to download on persistent failures
+
+Default to streaming; a circuit-breaker on per-merge failure rate routes the next attempt
+through download-first. Operationally simpler than Option A; tail latency is bounded by the
+circuit's reaction time.
+
+### Option C: Keep download-first but eliminate the in-memory merge path
+
+Make every merge go through `execute_merge_operation` with `LocalFileByteSource`. This doesn't
+recover the streaming engine's "no scratch disk" benefit but does remove the legacy in-memory
+codepath, simplifying the executor to a single path.
+
+Smallest change, smallest gain. Worth doing regardless of A/B as a stepping stone.
+
+### Option D: Streaming-directly only for promotion merges
+
+Promotion already routes through `execute_merge_operation`; extend it to skip the download
+phase entirely for those operations and let the regular path stay as-is. Gains: legacy-adapter-
+backed promotion merges (the in-memory-buffering-heaviest case in the pipeline) avoid double
+I/O. Costs: split executor logic into "promotion = stream" vs "regular = download."
+
+## Signal Impact
+
+All Parquet-backed signals. Metrics is the first product to ship, so the impact lands on
+metrics first; traces and logs (when they migrate to Parquet storage) will pay the same cost
+unless this is addressed by then.
+
+## Cost Considerations
+
+The streaming engine's body-col page cache is already designed for backpressure: pages stream
+in column-major order as bytes arrive, and the engine processes them as quickly as it can. The
+bottleneck for streaming-directly becomes the slowest input's transfer rate rather than the
+total input size — usually a smaller number, but a longer tail.
+
+## Impact
+
+- **Severity**: Medium. Correctness is unaffected; the streaming engine works equivalently
+  whether the source is local or remote. The cost is bandwidth, disk, and wall-clock latency.
+- **Frequency**: Every merge in production today pays the download cost.
+- **Affected Areas**: `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_split_downloader.rs`,
+  `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs`,
+  `quickwit-parquet-engine::merge::execute_merge_operation` callers.
+
+## Next Steps
+
+- [ ] Measure the current download-vs-merge phase breakdown on a representative production
+      merge load (wall-clock + bytes-read + disk-write).
+- [ ] Build a `RemoteByteSource` adapter over `quickwit_storage::Storage` and prototype
+      streaming-directly for promotion merges (Option D) to validate the engine's behavior
+      against the existing storage backends.
+- [ ] Decide between options A / B based on observed mid-merge failure rates under real S3
+      conditions.
+- [ ] Even if the default stays download-first, consider Option C as a simplification — the
+      in-memory merge path is dead weight once the streaming engine handles every case.
+
+## References
+
+- PR #6407–#6428 (Parquet streaming merge stack).
+- [PR #6423 discussion](https://github.com/quickwit-oss/quickwit/pull/6423) — surfaced the
+  question while wiring promotion through `execute_merge_operation`.
+- `quickwit-parquet-engine/src/storage/streaming_reader.rs` (`RemoteByteSource` trait).
+- `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs::LocalFileByteSource`.
diff --git a/docs/internals/adr/gaps/README.md b/docs/internals/adr/gaps/README.md
index 2e69c27cdbd..e8ef3de02e3 100644
--- a/docs/internals/adr/gaps/README.md
+++ b/docs/internals/adr/gaps/README.md
@@ -115,3 +115,5 @@ Gap files use sequential numbering: `001-short-description.md`
 | [008](./008-no-high-query-rate-optimization.md) | No High Query Rate Optimization | Open | High |
 | [009](./009-no-leading-edge-prioritization.md) | No Leading Edge Prioritization | Open | High |
 | [010](./010-no-data-caching-or-query-affinity.md) | No Multi-Level Data Caching or Query Affinity Optimization | Open | High |
+| [011](./011-no-legacy-promotion-planner.md) | No Planner-Level Legacy Promotion | Open | Medium |
+| [012](./012-merge-downloads-instead-of-streaming.md) | Parquet Merge Executor Downloads Inputs Instead of Streaming Them | Open | Medium |
diff --git a/quickwit/Cargo.toml b/quickwit/Cargo.toml
index 5d9e5ce401d..83843be47ae 100644
--- a/quickwit/Cargo.toml
+++ b/quickwit/Cargo.toml
@@ -188,7 +188,7 @@ opentelemetry-appender-tracing = "0.31"
 opentelemetry_sdk = { version = "0.31", features = ["rt-tokio"] }
 opentelemetry-otlp = { version = "0.31", features = ["grpc-tonic", "http-json"] }
 ouroboros = "0.18"
-parquet = { version = "58", default-features = false, features = ["arrow", "snap", "variant_experimental", "zstd"] }
+parquet = { version = "58", default-features = false, features = ["arrow", "experimental", "snap", "variant_experimental", "zstd"] }
 percent-encoding = "2.3"
 pin-project = "1.1"
 pnet = { version = "0.35", features = ["std"] }
diff --git a/quickwit/quickwit-config/resources/tests/node_config/quickwit.json b/quickwit/quickwit-config/resources/tests/node_config/quickwit.json
index 7269b37ae22..6909594396b 100644
--- a/quickwit/quickwit-config/resources/tests/node_config/quickwit.json
+++ b/quickwit/quickwit-config/resources/tests/node_config/quickwit.json
@@ -54,7 +54,8 @@
         "split_store_max_num_splits": 10000,
         "max_concurrent_split_uploads": 8,
         "max_merge_write_throughput": "100mb",
-        "merge_concurrency": 2
+        "merge_concurrency": 2,
+        "parquet_merge_use_streaming_engine": true
     },
     "ingest_api": {
         "replication_factor": 2
diff --git a/quickwit/quickwit-config/resources/tests/node_config/quickwit.toml b/quickwit/quickwit-config/resources/tests/node_config/quickwit.toml
index ea715dcffe0..80d3ec28867 100644
--- a/quickwit/quickwit-config/resources/tests/node_config/quickwit.toml
+++ b/quickwit/quickwit-config/resources/tests/node_config/quickwit.toml
@@ -45,6 +45,7 @@ split_store_max_num_splits = 10_000
 max_concurrent_split_uploads = 8
 max_merge_write_throughput = "100mb"
 merge_concurrency = 2
+parquet_merge_use_streaming_engine = true
 
 [ingest_api]
 replication_factor = 2
diff --git a/quickwit/quickwit-config/resources/tests/node_config/quickwit.yaml b/quickwit/quickwit-config/resources/tests/node_config/quickwit.yaml
index face0852972..2c33cc367c0 100644
--- a/quickwit/quickwit-config/resources/tests/node_config/quickwit.yaml
+++ b/quickwit/quickwit-config/resources/tests/node_config/quickwit.yaml
@@ -49,6 +49,7 @@ indexer:
   max_concurrent_split_uploads: 8
   max_merge_write_throughput: 100mb
   merge_concurrency: 2
+  parquet_merge_use_streaming_engine: true
 
 ingest_api:
   replication_factor: 2
diff --git a/quickwit/quickwit-config/src/node_config/mod.rs b/quickwit/quickwit-config/src/node_config/mod.rs
index cf67768966d..fff19960e57 100644
--- a/quickwit/quickwit-config/src/node_config/mod.rs
+++ b/quickwit/quickwit-config/src/node_config/mod.rs
@@ -163,6 +163,16 @@ pub struct IndexerConfig {
     pub enable_cooperative_indexing: bool,
     #[serde(default = "IndexerConfig::default_cpu_capacity")]
     pub cpu_capacity: CpuCapacity,
+    /// If true, run Parquet merges through the streaming column-major engine
+    /// (`execute_merge_operation`). If false (default), use the in-memory
+    /// `merge_sorted_parquet_files` engine. The legacy in-memory engine is
+    /// kept as the runtime fallback so production can flip back to it
+    /// without redeploying if the streaming engine hits a bug. Promotion
+    /// merges (those with `target_prefix_len_override`) always go through
+    /// the streaming engine regardless of this flag — the in-memory path
+    /// can't handle mixed prefix lengths.
+    #[serde(default = "IndexerConfig::default_parquet_merge_use_streaming_engine")]
+    pub parquet_merge_use_streaming_engine: bool,
 }
 
 impl IndexerConfig {
@@ -201,6 +211,10 @@ impl IndexerConfig {
         CpuCapacity::one_cpu_thread() * (quickwit_common::num_cpus() as u32)
     }
 
+    fn default_parquet_merge_use_streaming_engine() -> bool {
+        false
+    }
+
     #[cfg(any(test, feature = "testsuite"))]
     pub fn for_test() -> anyhow::Result<Self> {
         use quickwit_proto::indexing::PIPELINE_FULL_CAPACITY;
@@ -213,6 +227,7 @@ impl IndexerConfig {
             cpu_capacity: PIPELINE_FULL_CAPACITY * 4u32,
             max_merge_write_throughput: None,
             merge_concurrency: NonZeroUsize::new(3).unwrap(),
+            parquet_merge_use_streaming_engine: Self::default_parquet_merge_use_streaming_engine(),
         };
         Ok(indexer_config)
     }
@@ -229,6 +244,7 @@ impl Default for IndexerConfig {
             cpu_capacity: Self::default_cpu_capacity(),
             merge_concurrency: Self::default_merge_concurrency(),
             max_merge_write_throughput: None,
+            parquet_merge_use_streaming_engine: Self::default_parquet_merge_use_streaming_engine(),
         }
     }
 }
diff --git a/quickwit/quickwit-config/src/node_config/serialize.rs b/quickwit/quickwit-config/src/node_config/serialize.rs
index ae19a92c45f..c7740e7146e 100644
--- a/quickwit/quickwit-config/src/node_config/serialize.rs
+++ b/quickwit/quickwit-config/src/node_config/serialize.rs
@@ -657,6 +657,7 @@ mod tests {
                 cpu_capacity: IndexerConfig::default_cpu_capacity(),
                 enable_cooperative_indexing: false,
                 max_merge_write_throughput: Some(ByteSize::mb(100)),
+                parquet_merge_use_streaming_engine: true,
             }
         );
         assert_eq!(
diff --git a/quickwit/quickwit-indexing/src/actors/indexing_service.rs b/quickwit/quickwit-indexing/src/actors/indexing_service.rs
index 3f0bd91c301..4453ffbfe00 100644
--- a/quickwit/quickwit-indexing/src/actors/indexing_service.rs
+++ b/quickwit/quickwit-indexing/src/actors/indexing_service.rs
@@ -113,6 +113,12 @@ pub struct IndexingService {
     counters: IndexingServiceCounters,
     local_split_store: Arc<IndexingSplitCache>,
     pub(crate) max_concurrent_split_uploads: usize,
+    /// Cached from `IndexerConfig`. Selects whether new Parquet merge
+    /// pipelines route regular merges through the streaming engine or
+    /// the in-memory fallback. Promotion merges always use the
+    /// streaming engine regardless of this flag.
+    #[cfg(feature = "metrics")]
+    pub(crate) parquet_merge_use_streaming_engine: bool,
     merge_pipeline_handles: HashMap<MergePipelineId, MergePipelineHandle>,
     #[cfg(feature = "metrics")]
     parquet_merge_pipeline_handles: HashMap<IndexUid, ParquetMergePipelineHandle>,
@@ -178,6 +184,8 @@ impl IndexingService {
             indexing_pipelines: Default::default(),
             counters: Default::default(),
             max_concurrent_split_uploads: indexer_config.max_concurrent_split_uploads,
+            #[cfg(feature = "metrics")]
+            parquet_merge_use_streaming_engine: indexer_config.parquet_merge_use_streaming_engine,
             merge_pipeline_handles: HashMap::new(),
             #[cfg(feature = "metrics")]
             parquet_merge_pipeline_handles: HashMap::new(),
@@ -723,6 +731,8 @@ impl IndexingService {
             max_concurrent_split_uploads: self.max_concurrent_split_uploads,
             event_broker: self.event_broker.clone(),
             writer_config,
+            use_streaming_engine: self.parquet_merge_use_streaming_engine,
+            target_split_size_bytes: cfg.target_split_size_bytes,
         };
 
         let pipeline = super::parquet_pipeline::ParquetMergePipeline::new(
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/mod.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/mod.rs
index 65ebe3a5333..fc598889d81 100644
--- a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/mod.rs
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/mod.rs
@@ -61,6 +61,10 @@ mod parquet_merge_pipeline_trace_conformance_test;
 #[allow(clippy::disallowed_methods)]
 mod parquet_merge_pipeline_sketch_test;
 
+#[cfg(test)]
+#[allow(clippy::disallowed_methods)]
+mod parquet_merge_pipeline_multi_metric_test;
+
 pub use parquet_doc_processor::{
     ParquetDocProcessor, ParquetDocProcessorCounters, ParquetDocProcessorError, is_arrow_ipc,
 };
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs
index 8d2649973a2..ffe27b09a0e 100644
--- a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs
@@ -14,12 +14,36 @@
 
 //! Parquet merge executor actor.
 //!
-//! Calls the Phase 1 merge engine (`merge_sorted_parquet_files`) via
-//! `run_cpu_intensive()`, builds output split metadata using
-//! `merge_parquet_split_metadata()`, and sends the result to the uploader.
+//! Receives a `ParquetMergeScratch` from the downloader. Two engines are available:
+//!
+//! - **Streaming engine** (`execute_merge_operation`): column-major, page-bounded body cache. Used
+//!   unconditionally for promotion merges (the in-memory path can't handle mixed prefix lengths).
+//!   Optionally used for regular merges when the node-level
+//!   `IndexerConfig::parquet_merge_use_streaming_engine` flag is true.
+//! - **In-memory engine** (`merge_sorted_parquet_files`): buffers all inputs in memory and runs
+//!   inside `run_cpu_intensive`. Kept as the runtime fallback so production can flip back via YAML
+//!   config if the streaming engine hits a bug. To be removed once the streaming path has soaked.
+//!
+//! Routing in `handle()`:
+//!
+//! - `target_prefix_len_override.is_some()` → streaming engine. Promotion is the whole point of
+//!   `target_prefix_len_override`, and the in-memory path's `extract_and_validate_input_metadata`
+//!   would bail on the mixed `rg_partition_prefix_len` before any output is produced.
+//! - Else `use_streaming_engine == true` → streaming engine (the new default once soaked).
+//! - Else → in-memory engine (the runtime fallback).
+//!
+//! `mixed_prefix_ok` is passed to `merge_parquet_split_metadata` only for promotion merges so
+//! the post-merge aggregator's strict input-side equality check stays on for ordinary
+//! same-prefix merges.
+
+use std::io;
+use std::ops::Range;
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
 
 use anyhow::Context;
 use async_trait::async_trait;
+use bytes::Bytes;
 use quickwit_actors::{Actor, ActorContext, ActorExitStatus, Handler, Mailbox, QueueCapacity};
 use quickwit_common::thread_pool::run_cpu_intensive;
 use quickwit_dst::check_invariant;
@@ -28,9 +52,12 @@ use quickwit_dst::invariants::merge_policy::{
     all_same_compaction_scope, all_same_merge_level, has_minimum_splits,
 };
 use quickwit_parquet_engine::merge::metadata_aggregation::merge_parquet_split_metadata;
-use quickwit_parquet_engine::merge::{MergeConfig, MergeOutputFile, merge_sorted_parquet_files};
-use quickwit_parquet_engine::storage::ParquetWriterConfig;
+use quickwit_parquet_engine::merge::{
+    MergeConfig, MergeOutputFile, execute_merge_operation, merge_sorted_parquet_files,
+};
+use quickwit_parquet_engine::storage::{ParquetWriterConfig, RemoteByteSource};
 use quickwit_proto::types::IndexUid;
+use tokio::io::{AsyncRead, AsyncReadExt, AsyncSeekExt};
 use tracing::{info, instrument, warn};
 
 use super::ParquetUploader;
@@ -38,6 +65,53 @@ use super::parquet_indexer::ParquetSplitBatch;
 use super::parquet_merge_messages::{ParquetMergeScratch, ParquetMergeTask};
 use crate::models::PublishLock;
 
+/// `RemoteByteSource` adapter over a single local file. Used by the
+/// promotion-merge path to feed downloaded scratch-directory files
+/// into `execute_merge_operation` (which composes them with
+/// `LegacyInputAdapter` as needed).
+///
+/// Each instance is bound to one absolute path at construction time
+/// and ignores the `path` argument from the trait methods — the
+/// trait surface assumes a remote backend keyed by path, but the
+/// downloader has already resolved each split to a concrete local
+/// file before the executor runs.
+struct LocalFileByteSource {
+    path: PathBuf,
+}
+
+impl LocalFileByteSource {
+    fn new(path: PathBuf) -> Self {
+        Self { path }
+    }
+}
+
+#[async_trait]
+impl RemoteByteSource for LocalFileByteSource {
+    async fn file_size(&self, _path: &Path) -> io::Result<u64> {
+        tokio::fs::metadata(&self.path).await.map(|m| m.len())
+    }
+
+    async fn get_slice(&self, _path: &Path, range: Range<u64>) -> io::Result<Bytes> {
+        let mut file = tokio::fs::File::open(&self.path).await?;
+        file.seek(io::SeekFrom::Start(range.start)).await?;
+        let len = (range.end - range.start) as usize;
+        let mut buf = vec![0u8; len];
+        file.read_exact(&mut buf).await?;
+        Ok(Bytes::from(buf))
+    }
+
+    async fn get_slice_stream(
+        &self,
+        _path: &Path,
+        range: Range<u64>,
+    ) -> io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+        let mut file = tokio::fs::File::open(&self.path).await?;
+        file.seek(io::SeekFrom::Start(range.start)).await?;
+        let len = range.end - range.start;
+        Ok(Box::new(file.take(len)))
+    }
+}
+
 /// Executes Parquet merge operations using the Phase 1 k-way merge engine.
 ///
 /// Receives `ParquetMergeScratch` from the downloader, runs the merge as a
@@ -49,18 +123,58 @@ use crate::models::PublishLock;
 pub struct ParquetMergeExecutor {
     uploader_mailbox: Mailbox<ParquetUploader>,
     writer_config: ParquetWriterConfig,
+    /// When true, regular merges run through the streaming engine. When
+    /// false, they run through the in-memory `merge_sorted_parquet_files`
+    /// fallback. Promotion merges always use the streaming engine
+    /// regardless of this flag.
+    use_streaming_engine: bool,
+    /// Target output split size, sourced from the merge policy. Drives
+    /// `num_outputs`: the executor asks the merge engine for
+    /// `ceil(total_input_bytes / target_split_size_bytes)` outputs so a
+    /// merge that ingests more than one target's worth of data spreads
+    /// across multiple output files. The engine clamps the request to
+    /// the number of `sorted_series` boundaries actually available, so
+    /// the result is an upper bound rather than an exact count.
+    /// Operations whose inputs already fit in one target naturally get
+    /// `num_outputs = 1`.
+    target_split_size_bytes: u64,
 }
 
 impl ParquetMergeExecutor {
     pub fn new(
         uploader_mailbox: Mailbox<ParquetUploader>,
         writer_config: ParquetWriterConfig,
+        use_streaming_engine: bool,
+        target_split_size_bytes: u64,
     ) -> Self {
         Self {
             uploader_mailbox,
             writer_config,
+            use_streaming_engine,
+            target_split_size_bytes,
         }
     }
+
+    /// Compute the requested `num_outputs` for a merge. Returns at
+    /// least 1 (the merge always produces at least one output unless
+    /// every input is empty). Guards against the
+    /// `target_split_size_bytes = 0` misconfiguration — falling back to
+    /// 1 rather than dividing by zero.
+    fn compute_num_outputs(&self, total_input_bytes: u64) -> usize {
+        if self.target_split_size_bytes == 0 {
+            return 1;
+        }
+        // Ceiling division without overflow: `(a + b - 1) / b` would
+        // overflow at u64::MAX; use the explicit add-one-after-divide
+        // form when `a` is non-zero.
+        let quot = total_input_bytes / self.target_split_size_bytes;
+        let extra = if total_input_bytes.is_multiple_of(self.target_split_size_bytes) {
+            0
+        } else {
+            1
+        };
+        ((quot + extra) as usize).max(1)
+    }
 }
 
 #[async_trait]
@@ -108,18 +222,59 @@ impl Handler<ParquetMergeScratch> for ParquetMergeExecutor {
             .context("failed to create merge output directory")
             .map_err(|e| ActorExitStatus::from(anyhow::anyhow!(e)))?;
 
-        // Run the CPU-intensive merge on the dedicated thread pool.
-        let input_paths = scratch.downloaded_parquet_files.clone();
-        let output_dir_clone = output_dir.clone();
-        let writer_config = self.writer_config.clone();
-        let merge_result = run_cpu_intensive(move || {
+        // Promotion merges (`target_prefix_len_override.is_some()`) must
+        // use the streaming engine; the in-memory path's
+        // `extract_and_validate_input_metadata` would bail on mixed
+        // `rg_partition_prefix_len` before producing any output. Regular
+        // merges follow the operator-controlled `use_streaming_engine`
+        // flag: true means the streaming engine, false means the
+        // in-memory fallback. Keeping the in-memory branch lets
+        // production flip back at runtime if the streaming engine hits a
+        // bug; once the streaming path has soaked, the in-memory branch
+        // and `merge_sorted_parquet_files` itself can be removed.
+        let total_input_bytes = scratch.merge_operation.total_size_bytes();
+        let num_outputs = self.compute_num_outputs(total_input_bytes);
+        info!(
+            merge_split_id = %merge_split_id,
+            total_input_bytes,
+            target_split_size_bytes = self.target_split_size_bytes,
+            num_outputs,
+            "computed num_outputs from total input bytes / target split size"
+        );
+
+        let is_promotion = scratch.merge_operation.target_prefix_len_override.is_some();
+        let merge_result: Result<Result<Vec<MergeOutputFile>, _>, _> = if is_promotion
+            || self.use_streaming_engine
+        {
+            let sources: Vec<Arc<dyn RemoteByteSource>> = scratch
+                .downloaded_parquet_files
+                .iter()
+                .map(|path| Arc::new(LocalFileByteSource::new(path.clone())) as _)
+                .collect();
             let config = MergeConfig {
-                num_outputs: 1,
-                writer_config,
+                num_outputs,
+                writer_config: self.writer_config.clone(),
             };
-            merge_sorted_parquet_files(&input_paths, &output_dir_clone, &config)
-        })
-        .await;
+            Ok(
+                execute_merge_operation(&scratch.merge_operation, sources, &output_dir, &config)
+                    .await,
+            )
+        } else {
+            // Fallback: in-memory engine under `run_cpu_intensive`.
+            // Kept as the runtime rollback target while the streaming
+            // engine soaks in production.
+            let input_paths = scratch.downloaded_parquet_files.clone();
+            let output_dir_clone = output_dir.clone();
+            let writer_config = self.writer_config.clone();
+            run_cpu_intensive(move || {
+                let config = MergeConfig {
+                    num_outputs,
+                    writer_config,
+                };
+                merge_sorted_parquet_files(&input_paths, &output_dir_clone, &config)
+            })
+            .await
+        };
 
         let outputs: Vec<MergeOutputFile> = match merge_result {
             Ok(Ok(outputs)) => outputs,
@@ -231,17 +386,30 @@ impl Handler<ParquetMergeScratch> for ParquetMergeExecutor {
             return Ok(());
         }
 
+        // `mixed_prefix_ok` matches the operation's promotion mode:
+        // promote-legacy operations bundle inputs from different
+        // `rg_partition_prefix_len` buckets (the adapter normalizes
+        // them at read time), so the input-side equality check in
+        // `merge_parquet_split_metadata` would spuriously fail. Regular
+        // merges keep the strict check.
+        let mixed_prefix_ok = scratch.merge_operation.target_prefix_len_override.is_some();
+
         let mut merged_splits = Vec::with_capacity(outputs.len());
-        for output in &outputs {
-            let mut metadata = merge_parquet_split_metadata(input_splits, output)
+        // First output keeps the planner-assigned `merge_split_id` so
+        // existing observability paths (logs, metrics, traces keyed on
+        // this ID) continue to see the planned ID at execute time. For
+        // n>1 the subsequent outputs get fresh IDs generated by
+        // `merge_parquet_split_metadata`. Assigning the same ID to
+        // multiple outputs would collide on the rename below and
+        // overwrite earlier files.
+        for (output_idx, output) in outputs.iter().enumerate() {
+            let mut metadata = merge_parquet_split_metadata(input_splits, output, mixed_prefix_ok)
                 .context("failed to build merge output metadata")
                 .map_err(|e| ActorExitStatus::from(anyhow::anyhow!(e)))?;
 
-            // Use the split ID that was assigned when the merge operation was
-            // planned, rather than the one generated inside
-            // merge_parquet_split_metadata(). This keeps the ID consistent
-            // across scheduling, tracing, and the final published split.
-            metadata.split_id = scratch.merge_operation.merge_split_id.clone();
+            if output_idx == 0 {
+                metadata.split_id = scratch.merge_operation.merge_split_id.clone();
+            }
             metadata.parquet_file = metadata.split_id.to_string() + ".parquet";
 
             // Rename the output file to match the split ID.
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline.rs
index 7794e3f14ed..e028dad0f3a 100644
--- a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline.rs
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline.rs
@@ -309,8 +309,12 @@ impl ParquetMergePipeline {
             .spawn(merge_uploader);
 
         // 4. Merge executor
-        let merge_executor =
-            ParquetMergeExecutor::new(merge_uploader_mailbox, self.params.writer_config.clone());
+        let merge_executor = ParquetMergeExecutor::new(
+            merge_uploader_mailbox,
+            self.params.writer_config.clone(),
+            self.params.use_streaming_engine,
+            self.params.target_split_size_bytes,
+        );
         let (merge_executor_mailbox, merge_executor_handle) = ctx
             .spawn_actor()
             .set_kill_switch(self.kill_switch.clone())
@@ -601,6 +605,20 @@ pub struct ParquetMergePipelineParams {
     /// Should match the ingest pipeline's writer config so merged files have
     /// consistent compression.
     pub writer_config: quickwit_parquet_engine::storage::ParquetWriterConfig,
+    /// When true, regular merges run through the streaming engine
+    /// (`execute_merge_operation`); when false, they run through the
+    /// in-memory `merge_sorted_parquet_files` fallback. Promotion
+    /// merges always use the streaming engine. Sourced from
+    /// `IndexerConfig::parquet_merge_use_streaming_engine`.
+    pub use_streaming_engine: bool,
+    /// Target output split size in bytes, sourced from the merge
+    /// policy. The executor uses this to compute `num_outputs` from
+    /// total input size, so a merge that ingests more than one
+    /// target's worth of data is allowed to spread across multiple
+    /// output files. Smaller values increase output split count
+    /// (subject to the number of `sorted_series` boundaries actually
+    /// available in the input).
+    pub target_split_size_bytes: u64,
 }
 
 #[cfg(test)]
@@ -646,6 +664,8 @@ mod tests {
             max_concurrent_split_uploads: 4,
             event_broker: EventBroker::default(),
             writer_config: quickwit_parquet_engine::storage::ParquetWriterConfig::default(),
+            use_streaming_engine: false,
+            target_split_size_bytes: 256 * 1024 * 1024,
         }
     }
 
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_crash_test.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_crash_test.rs
index 050c922e8d6..b5f39ec4733 100644
--- a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_crash_test.rs
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_crash_test.rs
@@ -203,6 +203,8 @@ async fn test_merge_pipeline_crash_and_restart() {
         max_concurrent_split_uploads: 4,
         event_broker: EventBroker::default(),
         writer_config: ParquetWriterConfig::default(),
+        use_streaming_engine: false,
+        target_split_size_bytes: 256 * 1024 * 1024,
     };
 
     let pipeline = ParquetMergePipeline::new(params, Some(initial_splits), universe.spawn_ctx());
@@ -326,6 +328,8 @@ async fn test_merge_pipeline_multi_round() {
         max_concurrent_split_uploads: 4,
         event_broker: EventBroker::default(),
         writer_config: ParquetWriterConfig::default(),
+        use_streaming_engine: false,
+        target_split_size_bytes: 256 * 1024 * 1024,
     };
 
     let pipeline = ParquetMergePipeline::new(params, Some(initial_splits), universe.spawn_ctx());
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_multi_metric_test.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_multi_metric_test.rs
new file mode 100644
index 00000000000..676eb578965
--- /dev/null
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_multi_metric_test.rs
@@ -0,0 +1,1061 @@
+// Copyright 2021-Present Datadog, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! End-to-end pipeline tests exercising the merge engines on multi-input,
+//! multi-metric, multi-row-group fixtures, in both n=1 and m:n (n > 1)
+//! output configurations. Complements `parquet_merge_pipeline_test.rs`
+//! (which covers the simpler two-input, one-metric-per-input case) with
+//! the harder scenarios:
+//!
+//! - **Three inputs**, each carrying **three metrics** (`aaa.alpha`, `bbb.beta`, `ccc.gamma`).
+//!   Across inputs, the metrics overlap and the per-metric timeseries IDs collide (each row's
+//!   `timeseries_id` is derived from the metric name, so input-x, input-y, input-z all share the
+//!   same set of IDs per metric). Timestamps within each (metric, timeseries) overlap across inputs
+//!   but are unique — the merge must interleave rows from all three inputs heavily, not concatenate
+//!   them.
+//! - **Multi-row-group output** via `ParquetWriterConfig::row_group_size = 50` on the n=1 tests, so
+//!   the 180-row merge output breaks into 4 row groups. Exercises the writer's multi-RG path in
+//!   both engines.
+//! - **Multi-row-group inputs with `rg_partition_prefix_len = 1`** in the bonus tests
+//!   (`write_prefix_aligned_input`): the writer flushes one row group per distinct `metric_name`,
+//!   so each input file carries three row groups in alignment with the sort prefix. The streaming
+//!   engine reads these through its prefix-aware fast path.
+//! - **m:n merges** in the bonus tests: a small
+//!   `ParquetMergePipelineParams::target_split_size_bytes` forces the executor to ask the engine
+//!   for `num_outputs > 1`. The bonus assertions cover the multi-output contract —
+//!   sum-equals-total, internal monotonicity, inter-output `sorted_series` disjointness, and
+//!   union-equals-full-set on metrics/services.
+//!
+//! Both `ParquetMergePipelineParams::use_streaming_engine = false` (the
+//! in-memory engine) and `= true` (the streaming engine) are exercised
+//! across all four scenarios (n=1 × {prefix_len=0}, n>1 × {prefix_len=1}).
+//! Streaming-engine variants additionally assert
+//! `PEAK_BODY_COL_PAGE_CACHE_LEN > 0` to confirm the flag routed through
+//! the streaming path and not the in-memory fallback.
+
+use std::collections::HashSet;
+use std::path::Path;
+use std::sync::Arc;
+use std::sync::atomic::Ordering;
+use std::time::Duration;
+
+use arrow::record_batch::RecordBatch;
+use quickwit_actors::Universe;
+use quickwit_common::pubsub::EventBroker;
+use quickwit_common::temp_dir::TempDirectory;
+use quickwit_common::test_utils::wait_until_predicate;
+use quickwit_metastore::StageParquetSplitsRequestExt;
+use quickwit_parquet_engine::merge::policy::{
+    ConstWriteAmplificationParquetMergePolicy, ParquetMergePolicyConfig,
+};
+use quickwit_parquet_engine::sorted_series::SORTED_SERIES_COLUMN;
+use quickwit_parquet_engine::split::{ParquetSplitId, ParquetSplitMetadata, TimeRange};
+use quickwit_parquet_engine::storage::{ParquetWriter, ParquetWriterConfig};
+use quickwit_parquet_engine::table_config::TableConfig;
+use quickwit_proto::metastore::{
+    EmptyResponse, MetastoreServiceClient, MockMetastoreService, StageMetricsSplitsRequest,
+};
+use quickwit_storage::{RamStorage, Storage};
+
+use super::parquet_merge_pipeline::{ParquetMergePipeline, ParquetMergePipelineParams};
+use super::parquet_merge_pipeline_test::{
+    create_custom_test_batch, extract_binary_column, extract_string_column, extract_u64_column,
+    read_parquet_from_bytes, write_test_parquet_file,
+};
+
+// ---------------------------------------------------------------------------
+// Multi-metric fixture
+// ---------------------------------------------------------------------------
+
+/// Per-metric specification for one input file: `(metric_name, ts_start, num_rows)`.
+/// All rows for a given metric share the same `service` / `host` tag values
+/// across the file.
+type MetricSpec<'a> = (&'a str, u64, usize);
+
+/// Concatenate per-metric `RecordBatch`es into a single batch covering several
+/// metrics. The writer downstream sorts the rows before writing, so the
+/// caller does not need to sort.
+fn create_multi_metric_batch(
+    metrics: &[MetricSpec<'_>],
+    service_val: &str,
+    host_val: &str,
+) -> RecordBatch {
+    let batches: Vec<RecordBatch> = metrics
+        .iter()
+        .map(|(metric, ts_start, num_rows)| {
+            create_custom_test_batch(metric, *ts_start, *num_rows, service_val, host_val)
+        })
+        .collect();
+    let schema = batches[0].schema();
+    arrow::compute::concat_batches(&schema, &batches).expect("concat multi-metric batches")
+}
+
+/// Build a `ParquetSplitMetadata` advertising multiple metric names and a
+/// caller-supplied row count + time range. `prefix_len` controls the
+/// `rg_partition_prefix_len` stamped into the input file — 0 for legacy
+/// inputs, 1 for prefix-aligned inputs (each row group covers exactly one
+/// `metric_name`).
+fn make_multi_metric_split_metadata(
+    split_id: &str,
+    num_rows: u64,
+    size_bytes: u64,
+    ts_start: u64,
+    ts_end: u64,
+    metric_names: &[&str],
+    prefix_len: u32,
+) -> ParquetSplitMetadata {
+    let table_config = TableConfig::default();
+    let mut builder = ParquetSplitMetadata::metrics_builder()
+        .split_id(ParquetSplitId::new(split_id))
+        .index_uid("test-merge-index-multi:00000000000000000000000001")
+        .partition_id(0)
+        .time_range(TimeRange::new(ts_start, ts_end))
+        .num_rows(num_rows)
+        .size_bytes(size_bytes)
+        .sort_fields(table_config.effective_sort_fields())
+        .window_start_secs(0)
+        .window_duration_secs(900)
+        .rg_partition_prefix_len(prefix_len);
+    for metric in metric_names {
+        builder = builder.add_metric_name(*metric);
+    }
+    builder.build()
+}
+
+/// Write a multi-metric input file with `rg_partition_prefix_len = 1` and
+/// one row group per distinct metric. Picks `row_group_size = per-metric row
+/// count` so the writer naturally flushes at every metric boundary after
+/// sorting — each row group ends up containing exactly one distinct
+/// `metric_name`, satisfying the prefix-alignment invariant the writer
+/// validates on close.
+///
+/// Returns the file size in bytes (the caller stamps this back into the
+/// `ParquetSplitMetadata.size_bytes` field for the planner / executor).
+fn write_prefix_aligned_input(
+    dir: &Path,
+    filename: &str,
+    batch: &RecordBatch,
+    split_metadata: &ParquetSplitMetadata,
+    rows_per_metric: usize,
+) -> u64 {
+    let table_config = TableConfig::default();
+    let writer_config = ParquetWriterConfig::default().with_row_group_size(rows_per_metric);
+    let writer = ParquetWriter::new(writer_config, &table_config)
+        .expect("test ParquetWriter (prefix-aligned)");
+    let path = dir.join(filename);
+    let (file_size, _write_metadata) = writer
+        .write_to_file_with_metadata(batch, &path, Some(split_metadata))
+        .expect("write_to_file_with_metadata for prefix-aligned input");
+    file_size
+}
+
+/// Three canonical metric names sorted alphabetically. Picked so the sort
+/// order is unambiguous (every byte-comparison resolves on the first
+/// distinguishing character of the metric name).
+const METRIC_AAA: &str = "aaa.alpha";
+const METRIC_BBB: &str = "bbb.beta";
+const METRIC_CCC: &str = "ccc.gamma";
+
+/// Per-metric row count in each input. With three inputs and three metrics,
+/// the total input/output row count is `3 * 3 * ROWS_PER_METRIC = 180`.
+const ROWS_PER_METRIC_PER_INPUT: usize = 20;
+
+/// Total rows across all three inputs. Held constant across all tests in this
+/// module so per-test expectations stay self-consistent.
+const TOTAL_INPUT_ROWS: u64 = 3 * 3 * ROWS_PER_METRIC_PER_INPUT as u64;
+
+/// Output writer's row-group size. Picked small enough that the n=1 merge
+/// (180 rows) produces 4 row groups and each n=3 output (~60 rows) produces
+/// 2. Exercises the multi-row-group write path in both engines without
+/// triggering prefix-alignment validation (we do not set
+/// `rg_partition_prefix_len` on inputs).
+const TEST_OUTPUT_ROW_GROUP_SIZE: usize = 50;
+
+/// Three inputs that share metric names and per-metric `timeseries_id`
+/// ranges (every row's `timeseries_id` is derived from the metric name
+/// hash + per-input row index, so cross-input collisions for the same
+/// `(metric_name, row_index_within_metric)` are intentional). Timestamps
+/// are picked so each (metric, timeseries) appears in all three inputs at
+/// three distinct overlapping timestamps — the merge must interleave row
+/// streams from all three inputs, not concatenate them.
+///
+/// Returns `(file_paths, splits_metadata)` for direct use by the pipeline
+/// (one entry per input, paths uploaded to the supplied storage).
+async fn build_three_multi_metric_inputs(
+    temp_dir: &Path,
+    ram_storage: &Arc<dyn Storage>,
+) -> (Vec<std::path::PathBuf>, Vec<ParquetSplitMetadata>) {
+    // input-x: every metric at ts 100..120 with service=web, host=h1.
+    let batch_x = create_multi_metric_batch(
+        &[
+            (METRIC_AAA, 100, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_BBB, 100, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_CCC, 100, ROWS_PER_METRIC_PER_INPUT),
+        ],
+        "web",
+        "h1",
+    );
+
+    // input-y: every metric at ts 110..130 with service=api, host=h2.
+    let batch_y = create_multi_metric_batch(
+        &[
+            (METRIC_AAA, 110, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_BBB, 110, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_CCC, 110, ROWS_PER_METRIC_PER_INPUT),
+        ],
+        "api",
+        "h2",
+    );
+
+    // input-z: every metric at ts 120..140 with service=db, host=h3.
+    let batch_z = create_multi_metric_batch(
+        &[
+            (METRIC_AAA, 120, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_BBB, 120, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_CCC, 120, ROWS_PER_METRIC_PER_INPUT),
+        ],
+        "db",
+        "h3",
+    );
+
+    let mut paths = Vec::new();
+    let mut splits = Vec::new();
+    for (split_id, batch, ts_start, ts_end, service, host) in [
+        ("split-x", &batch_x, 100, 120, "web", "h1"),
+        ("split-y", &batch_y, 110, 130, "api", "h2"),
+        ("split-z", &batch_z, 120, 140, "db", "h3"),
+    ] {
+        let _ = (service, host);
+        let filename = format!("{split_id}.parquet");
+        let num_rows = (3 * ROWS_PER_METRIC_PER_INPUT) as u64;
+        let meta = make_multi_metric_split_metadata(
+            split_id,
+            num_rows,
+            0, // size_bytes filled in below
+            ts_start,
+            ts_end,
+            &[METRIC_AAA, METRIC_BBB, METRIC_CCC],
+            0, // prefix_len = 0: legacy default, no per-RG alignment claim
+        );
+        let size_bytes = write_test_parquet_file(temp_dir, &filename, batch, &meta);
+        // Re-build metadata now that size_bytes is known. Mirrors what the
+        // simpler test does — keep both fields self-consistent.
+        let meta = {
+            let mut m = make_multi_metric_split_metadata(
+                split_id,
+                num_rows,
+                size_bytes,
+                ts_start,
+                ts_end,
+                &[METRIC_AAA, METRIC_BBB, METRIC_CCC],
+                0,
+            );
+            m.parquet_file = filename.clone();
+            m
+        };
+        let bytes_on_disk = std::fs::read(temp_dir.join(&filename)).unwrap();
+        ram_storage
+            .put(Path::new(&filename), Box::new(bytes_on_disk))
+            .await
+            .unwrap();
+        paths.push(temp_dir.join(&filename));
+        splits.push(meta);
+    }
+    (paths, splits)
+}
+
+/// Same shape and content as `build_three_multi_metric_inputs`, but each
+/// input is written with `rg_partition_prefix_len = 1` and one row group
+/// per distinct metric. With sort schema `metric_name | ...` and
+/// `row_group_size = ROWS_PER_METRIC_PER_INPUT`, the writer naturally
+/// flushes a row group every `ROWS_PER_METRIC_PER_INPUT` rows after
+/// sorting — those flush boundaries align with metric_name transitions,
+/// so each row group contains rows for exactly one distinct
+/// `metric_name`. The writer's prefix-alignment self-check passes, and
+/// the streaming engine reads the inputs as prefix_len=1 multi-row-group
+/// files.
+async fn build_three_prefix_aligned_multi_metric_inputs(
+    temp_dir: &Path,
+    ram_storage: &Arc<dyn Storage>,
+) -> (Vec<std::path::PathBuf>, Vec<ParquetSplitMetadata>) {
+    let batch_x = create_multi_metric_batch(
+        &[
+            (METRIC_AAA, 100, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_BBB, 100, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_CCC, 100, ROWS_PER_METRIC_PER_INPUT),
+        ],
+        "web",
+        "h1",
+    );
+    let batch_y = create_multi_metric_batch(
+        &[
+            (METRIC_AAA, 110, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_BBB, 110, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_CCC, 110, ROWS_PER_METRIC_PER_INPUT),
+        ],
+        "api",
+        "h2",
+    );
+    let batch_z = create_multi_metric_batch(
+        &[
+            (METRIC_AAA, 120, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_BBB, 120, ROWS_PER_METRIC_PER_INPUT),
+            (METRIC_CCC, 120, ROWS_PER_METRIC_PER_INPUT),
+        ],
+        "db",
+        "h3",
+    );
+
+    let mut paths = Vec::new();
+    let mut splits = Vec::new();
+    for (split_id, batch, ts_start, ts_end) in [
+        ("split-px", &batch_x, 100, 120),
+        ("split-py", &batch_y, 110, 130),
+        ("split-pz", &batch_z, 120, 140),
+    ] {
+        let filename = format!("{split_id}.parquet");
+        let num_rows = (3 * ROWS_PER_METRIC_PER_INPUT) as u64;
+        let meta = make_multi_metric_split_metadata(
+            split_id,
+            num_rows,
+            0,
+            ts_start,
+            ts_end,
+            &[METRIC_AAA, METRIC_BBB, METRIC_CCC],
+            1, // prefix_len = 1: one row group per metric_name.
+        );
+        let size_bytes = write_prefix_aligned_input(
+            temp_dir,
+            &filename,
+            batch,
+            &meta,
+            ROWS_PER_METRIC_PER_INPUT,
+        );
+        let meta = {
+            let mut m = make_multi_metric_split_metadata(
+                split_id,
+                num_rows,
+                size_bytes,
+                ts_start,
+                ts_end,
+                &[METRIC_AAA, METRIC_BBB, METRIC_CCC],
+                1,
+            );
+            m.parquet_file = filename.clone();
+            m
+        };
+        let bytes_on_disk = std::fs::read(temp_dir.join(&filename)).unwrap();
+        ram_storage
+            .put(Path::new(&filename), Box::new(bytes_on_disk))
+            .await
+            .unwrap();
+        paths.push(temp_dir.join(&filename));
+        splits.push(meta);
+    }
+    (paths, splits)
+}
+
+// ---------------------------------------------------------------------------
+// Mock metastore plumbing (captures staged + published state)
+// ---------------------------------------------------------------------------
+
+/// Handles returned by `mount_capturing_metastore`: caller waits on
+/// `publish_called`, then reads `staged_metadata` / `replaced_ids`.
+struct MetastoreCapture {
+    metastore: MetastoreServiceClient,
+    staged_metadata: Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>>,
+    replaced_ids: Arc<std::sync::Mutex<Vec<String>>>,
+    publish_called: Arc<std::sync::atomic::AtomicBool>,
+}
+
+fn mount_capturing_metastore() -> MetastoreCapture {
+    let mut mock_metastore = MockMetastoreService::new();
+
+    let staged_metadata: Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>> =
+        Arc::new(std::sync::Mutex::new(Vec::new()));
+    let staged_metadata_clone = staged_metadata.clone();
+    mock_metastore.expect_stage_metrics_splits().returning(
+        move |request: StageMetricsSplitsRequest| {
+            let splits = request
+                .deserialize_splits_metadata()
+                .expect("failed to deserialize staged metadata");
+            staged_metadata_clone.lock().unwrap().extend(splits);
+            Ok(EmptyResponse {})
+        },
+    );
+
+    let publish_called = Arc::new(std::sync::atomic::AtomicBool::new(false));
+    let publish_called_clone = publish_called.clone();
+    let replaced_ids = Arc::new(std::sync::Mutex::new(Vec::<String>::new()));
+    let replaced_ids_clone = replaced_ids.clone();
+    mock_metastore
+        .expect_publish_metrics_splits()
+        .returning(move |request| {
+            replaced_ids_clone
+                .lock()
+                .unwrap()
+                .extend(request.replaced_split_ids.clone());
+            publish_called_clone.store(true, Ordering::SeqCst);
+            Ok(EmptyResponse {})
+        });
+
+    MetastoreCapture {
+        metastore: MetastoreServiceClient::from_mock(mock_metastore),
+        staged_metadata,
+        replaced_ids,
+        publish_called,
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Pipeline-params builder
+// ---------------------------------------------------------------------------
+
+fn make_pipeline_params(
+    universe: &Universe,
+    metastore: MetastoreServiceClient,
+    ram_storage: Arc<dyn Storage>,
+    use_streaming_engine: bool,
+    target_split_size_bytes: u64,
+    max_merge_ops: u32,
+) -> ParquetMergePipelineParams {
+    // merge_factor = max_merge_factor = 3 lets the planner pick up all
+    // three inputs in a single operation. `target_split_size_bytes` on
+    // the policy controls when an additional merge-up is scheduled;
+    // `target_split_size_bytes` on the pipeline params controls how the
+    // executor splits the merge output. They are the same value in
+    // production but kept independent in tests so the bonus test can
+    // ask for a small output target without disturbing the planner.
+    //
+    // `max_merge_ops` bounds the cascade depth. n=1 tests use 5 (plenty
+    // of headroom — they produce one output that doesn't re-trigger
+    // the planner anyway). m:n tests use 1 to keep the test fixed at
+    // one merge: outputs from the first merge land at
+    // `num_merge_ops = 1` and the planner refuses to merge them again
+    // because they've hit the policy's ceiling.
+    let merge_policy = Arc::new(ConstWriteAmplificationParquetMergePolicy::new(
+        ParquetMergePolicyConfig {
+            merge_factor: 3,
+            max_merge_factor: 3,
+            max_merge_ops,
+            target_split_size_bytes: 256 * 1024 * 1024,
+            maturation_period: Duration::from_secs(3600),
+            max_finalize_merge_operations: 3,
+        },
+    ));
+
+    let writer_config =
+        ParquetWriterConfig::default().with_row_group_size(TEST_OUTPUT_ROW_GROUP_SIZE);
+
+    ParquetMergePipelineParams {
+        index_uid: quickwit_proto::types::IndexUid::for_test("test-merge-index-multi", 0),
+        indexing_directory: TempDirectory::for_test(),
+        metastore,
+        storage: ram_storage,
+        merge_policy,
+        merge_scheduler_service: universe.get_or_spawn_one(),
+        max_concurrent_split_uploads: 4,
+        event_broker: EventBroker::default(),
+        writer_config,
+        use_streaming_engine,
+        target_split_size_bytes,
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Shared assertions: rich n=1 case
+// ---------------------------------------------------------------------------
+
+/// Asserts the post-merge state for the canonical three-input,
+/// three-metric, single-output fixture used by this module. Both engines
+/// must produce a merged split that passes every check below.
+async fn assert_three_input_three_metric_single_output_correct(
+    staged_metadata: &Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>>,
+    replaced_ids: &Arc<std::sync::Mutex<Vec<String>>>,
+    ram_storage: &Arc<dyn Storage>,
+) {
+    // Publisher should replace exactly the three input splits.
+    let mut replaced_sorted: Vec<String> = replaced_ids.lock().unwrap().clone();
+    replaced_sorted.sort();
+    assert_eq!(
+        replaced_sorted,
+        vec![
+            "split-x".to_string(),
+            "split-y".to_string(),
+            "split-z".to_string(),
+        ],
+        "publish must replace all three input splits"
+    );
+
+    let staged = staged_metadata.lock().unwrap().clone();
+    assert_eq!(
+        staged.len(),
+        1,
+        "exactly one merged split should be staged for n=1 merge"
+    );
+    let merged_meta = &staged[0];
+
+    // MC-1: every input row is in the output.
+    assert_eq!(
+        merged_meta.num_rows, TOTAL_INPUT_ROWS,
+        "merged split must contain all {TOTAL_INPUT_ROWS} input rows"
+    );
+
+    // Time range is the union of inputs: ts_min = 100 (input-x's first row),
+    // ts_max = 139 (input-z's last row), end_secs is max + 1.
+    assert_eq!(
+        merged_meta.time_range.start_secs, 100,
+        "time_range.start should be the min timestamp across all inputs"
+    );
+    assert_eq!(
+        merged_meta.time_range.end_secs, 140,
+        "time_range.end should be max timestamp + 1"
+    );
+
+    // All three metric names must survive.
+    let expected_metrics: HashSet<String> = [METRIC_AAA, METRIC_BBB, METRIC_CCC]
+        .iter()
+        .map(|s| s.to_string())
+        .collect();
+    assert_eq!(
+        merged_meta.metric_names, expected_metrics,
+        "merged split must contain all three metric names from the input set"
+    );
+
+    // First merge over level-0 inputs.
+    assert_eq!(
+        merged_meta.num_merge_ops, 1,
+        "first merge must set num_merge_ops to 1"
+    );
+
+    let table_config = TableConfig::default();
+    assert_eq!(
+        merged_meta.sort_fields,
+        table_config.effective_sort_fields(),
+        "sort_fields must be preserved through merge"
+    );
+
+    // Row keys + zonemaps must be populated.
+    assert!(
+        merged_meta
+            .row_keys_proto
+            .as_ref()
+            .is_some_and(|b| !b.is_empty()),
+        "row_keys_proto must be present and non-empty"
+    );
+    assert!(
+        merged_meta.zonemap_regexes.contains_key("metric_name"),
+        "zonemap_regexes must include metric_name; got keys: {:?}",
+        merged_meta.zonemap_regexes.keys().collect::<Vec<_>>()
+    );
+
+    // Services across all three inputs must surface in low_cardinality_tags.
+    let service_tags = &merged_meta.low_cardinality_tags["service"];
+    let expected_services: HashSet<String> =
+        ["web", "api", "db"].iter().map(|s| s.to_string()).collect();
+    assert_eq!(
+        *service_tags, expected_services,
+        "service tag values must include all three inputs"
+    );
+
+    // Read the merged file back and verify content.
+    let merged_bytes = ram_storage
+        .get_all(Path::new(&merged_meta.parquet_file))
+        .await
+        .expect("merged parquet file must exist in storage");
+    let merged_batch = read_parquet_from_bytes(&merged_bytes);
+
+    assert_eq!(
+        merged_batch.num_rows() as u64,
+        TOTAL_INPUT_ROWS,
+        "merged Parquet file row count must match expected total"
+    );
+
+    // sorted_series is monotonically non-decreasing — the fundamental
+    // post-merge invariant.
+    let sorted_series = extract_binary_column(&merged_batch, SORTED_SERIES_COLUMN);
+    for i in 1..sorted_series.len() {
+        assert!(
+            sorted_series[i] >= sorted_series[i - 1],
+            "sorted_series must be non-decreasing at row {i}: {:?} < {:?}",
+            sorted_series[i],
+            sorted_series[i - 1],
+        );
+    }
+
+    // Sort-order semantics: with `metric_name` as the leading sort column
+    // (ascending), every `aaa.alpha` row precedes every `bbb.beta` row,
+    // and every `bbb.beta` row precedes every `ccc.gamma` row.
+    let metric_name_vec = extract_string_column(&merged_batch, "metric_name");
+    let last_aaa = metric_name_vec
+        .iter()
+        .rposition(|n| n == METRIC_AAA)
+        .expect("aaa rows must be present");
+    let first_bbb = metric_name_vec
+        .iter()
+        .position(|n| n == METRIC_BBB)
+        .expect("bbb rows must be present");
+    let last_bbb = metric_name_vec
+        .iter()
+        .rposition(|n| n == METRIC_BBB)
+        .expect("bbb rows must be present");
+    let first_ccc = metric_name_vec
+        .iter()
+        .position(|n| n == METRIC_CCC)
+        .expect("ccc rows must be present");
+    assert!(
+        last_aaa < first_bbb,
+        "all aaa rows must precede all bbb rows ({last_aaa} >= {first_bbb})",
+    );
+    assert!(
+        last_bbb < first_ccc,
+        "all bbb rows must precede all ccc rows ({last_bbb} >= {first_ccc})",
+    );
+
+    // Per-metric row count = 3 inputs * ROWS_PER_METRIC_PER_INPUT each.
+    let expected_per_metric = (3 * ROWS_PER_METRIC_PER_INPUT) as u64;
+    for metric in [METRIC_AAA, METRIC_BBB, METRIC_CCC] {
+        let count = metric_name_vec.iter().filter(|n| *n == metric).count() as u64;
+        assert_eq!(
+            count, expected_per_metric,
+            "metric {metric} must have exactly {expected_per_metric} rows; got {count}",
+        );
+    }
+
+    // All timestamps from the union [100, 140) must appear, and the metadata
+    // time_range must match the actual extrema.
+    let timestamps_in_file = extract_u64_column(&merged_batch, "timestamp_secs");
+    assert_eq!(
+        timestamps_in_file.len() as u64,
+        TOTAL_INPUT_ROWS,
+        "timestamp column must have one entry per row"
+    );
+    let ts_min = *timestamps_in_file.iter().min().unwrap();
+    let ts_max = *timestamps_in_file.iter().max().unwrap();
+    assert_eq!(ts_min, 100, "min timestamp must be 100");
+    assert_eq!(ts_max, 139, "max timestamp must be 139");
+    assert_eq!(
+        merged_meta.time_range.start_secs, ts_min,
+        "metadata time_range.start must match actual min timestamp"
+    );
+    assert_eq!(
+        merged_meta.time_range.end_secs,
+        ts_max + 1,
+        "metadata time_range.end must be max timestamp + 1"
+    );
+
+    // Service / host tag values from every input must appear in the file.
+    let services_in_file: HashSet<String> = extract_string_column(&merged_batch, "service")
+        .into_iter()
+        .collect();
+    assert_eq!(
+        services_in_file, expected_services,
+        "service column must contain all three input values"
+    );
+    let hosts_in_file: HashSet<String> = extract_string_column(&merged_batch, "host")
+        .into_iter()
+        .collect();
+    let expected_hosts: HashSet<String> =
+        ["h1", "h2", "h3"].iter().map(|s| s.to_string()).collect();
+    assert_eq!(
+        hosts_in_file, expected_hosts,
+        "host column must contain all three input values"
+    );
+}
+
+// ---------------------------------------------------------------------------
+// Shared assertions: m:n case (n > 1)
+// ---------------------------------------------------------------------------
+
+/// Asserts the post-merge state for the canonical three-input fixture when
+/// the merge produced **more than one output**. Both engines must satisfy
+/// the m:n contract:
+///
+/// - Replacement covers all three inputs.
+/// - The pipeline staged at least two output splits (proves splitting happened).
+/// - The sum of per-output row counts equals the total input row count.
+/// - Each output is internally sorted on `sorted_series`.
+/// - Across outputs, the `sorted_series` partition is **disjoint** (no two outputs share any
+///   `sorted_series` value — the merge engine splits at series boundaries, never inside).
+/// - The union of metric_names / services across outputs covers the full input set.
+/// - Every output declares `num_merge_ops = 1` (first merge over level-0 inputs) and has
+///   `row_keys_proto` + `metric_name` zonemap regex populated.
+async fn assert_three_input_three_metric_multi_output_correct(
+    staged_metadata: &Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>>,
+    replaced_ids: &Arc<std::sync::Mutex<Vec<String>>>,
+    ram_storage: &Arc<dyn Storage>,
+    expected_input_split_ids: &[&str],
+) {
+    let mut replaced_sorted: Vec<String> = replaced_ids.lock().unwrap().clone();
+    replaced_sorted.sort();
+    let mut expected_sorted: Vec<String> = expected_input_split_ids
+        .iter()
+        .map(|s| s.to_string())
+        .collect();
+    expected_sorted.sort();
+    assert_eq!(
+        replaced_sorted, expected_sorted,
+        "publish must replace all three input splits",
+    );
+
+    let staged = staged_metadata.lock().unwrap().clone();
+    assert!(
+        staged.len() >= 2,
+        "m:n merge must produce at least two outputs; got {}",
+        staged.len()
+    );
+
+    let total_output_rows: u64 = staged.iter().map(|s| s.num_rows).sum();
+    assert_eq!(
+        total_output_rows, TOTAL_INPUT_ROWS,
+        "sum of output row counts must equal total input rows",
+    );
+
+    // Each output internally sorted on sorted_series; collect ranges for
+    // the disjointness check across outputs.
+    let mut output_series_ranges: Vec<(Vec<u8>, Vec<u8>, String)> =
+        Vec::with_capacity(staged.len());
+    for meta in &staged {
+        let bytes = ram_storage
+            .get_all(Path::new(&meta.parquet_file))
+            .await
+            .expect("output parquet file must exist in storage");
+        let batch = read_parquet_from_bytes(&bytes);
+        assert_eq!(
+            batch.num_rows() as u64,
+            meta.num_rows,
+            "output {} row count {} disagrees with metadata num_rows {}",
+            meta.parquet_file,
+            batch.num_rows(),
+            meta.num_rows,
+        );
+        let series = extract_binary_column(&batch, SORTED_SERIES_COLUMN);
+        assert!(
+            !series.is_empty(),
+            "every output must have at least one row (empty outputs should be dropped by the \
+             engine)"
+        );
+        for i in 1..series.len() {
+            assert!(
+                series[i] >= series[i - 1],
+                "output {} sorted_series not monotone at row {i}",
+                meta.parquet_file,
+            );
+        }
+        output_series_ranges.push((
+            series.first().unwrap().clone(),
+            series.last().unwrap().clone(),
+            meta.parquet_file.clone(),
+        ));
+    }
+
+    // Sort outputs by min_series for pairwise disjointness comparison.
+    output_series_ranges.sort_by(|a, b| a.0.cmp(&b.0));
+    for window in output_series_ranges.windows(2) {
+        let (_, left_max, left_file) = &window[0];
+        let (right_min, _, right_file) = &window[1];
+        assert!(
+            left_max < right_min,
+            "outputs {} and {} overlap on sorted_series: left max = {:?}, right min = {:?}",
+            left_file,
+            right_file,
+            left_max,
+            right_min,
+        );
+    }
+
+    let union_metrics: HashSet<String> = staged
+        .iter()
+        .flat_map(|s| s.metric_names.iter().cloned())
+        .collect();
+    let expected_metrics: HashSet<String> = [METRIC_AAA, METRIC_BBB, METRIC_CCC]
+        .iter()
+        .map(|s| s.to_string())
+        .collect();
+    assert_eq!(
+        union_metrics, expected_metrics,
+        "union of output metric_names must equal the full input set",
+    );
+
+    let union_services: HashSet<String> = staged
+        .iter()
+        .flat_map(|s| {
+            s.low_cardinality_tags
+                .get("service")
+                .into_iter()
+                .flat_map(|set| set.iter().cloned())
+        })
+        .collect();
+    let expected_services: HashSet<String> =
+        ["web", "api", "db"].iter().map(|s| s.to_string()).collect();
+    assert_eq!(
+        union_services, expected_services,
+        "union of output services must equal the full input set",
+    );
+
+    for meta in &staged {
+        assert_eq!(
+            meta.num_merge_ops, 1,
+            "output {} num_merge_ops must be 1 for the first merge",
+            meta.parquet_file,
+        );
+        assert!(
+            meta.row_keys_proto.as_ref().is_some_and(|b| !b.is_empty()),
+            "output {} missing row_keys_proto",
+            meta.parquet_file,
+        );
+        assert!(
+            meta.zonemap_regexes.contains_key("metric_name"),
+            "output {} missing metric_name zonemap regex",
+            meta.parquet_file,
+        );
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Tests
+// ---------------------------------------------------------------------------
+
+/// Run a merge pipeline over the canonical three-input multi-metric fixture
+/// and apply the supplied assertion. Shared between the two engine variants
+/// below. `target_split_size_bytes` drives the executor's `num_outputs`
+/// calculation; pass `u64::MAX` (or anything bigger than the total input
+/// size) for the n=1 case, or a small value to force m:n.
+async fn run_three_input_multi_metric_merge<F, Fut>(
+    use_streaming_engine: bool,
+    target_split_size_bytes: u64,
+    assertions: F,
+) where
+    F: for<'a> FnOnce(
+        Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>>,
+        Arc<std::sync::Mutex<Vec<String>>>,
+        Arc<dyn Storage>,
+    ) -> Fut,
+    Fut: std::future::Future<Output = ()>,
+{
+    quickwit_common::setup_logging_for_tests();
+
+    let universe = Universe::with_accelerated_time();
+    let temp_dir = tempfile::tempdir().unwrap();
+    let ram_storage: Arc<dyn Storage> = Arc::new(RamStorage::default());
+
+    let (_paths, splits) = build_three_multi_metric_inputs(temp_dir.path(), &ram_storage).await;
+
+    let capture = mount_capturing_metastore();
+    let params = make_pipeline_params(
+        &universe,
+        capture.metastore.clone(),
+        ram_storage.clone(),
+        use_streaming_engine,
+        target_split_size_bytes,
+        5, // max_merge_ops: n=1 tests don't cascade, give plenty of headroom
+    );
+
+    let pipeline = ParquetMergePipeline::new(params, Some(splits), universe.spawn_ctx());
+    let (_pipeline_mailbox, _pipeline_handle) = universe.spawn_builder().spawn(pipeline);
+
+    wait_until_predicate(
+        || {
+            let publish_called = capture.publish_called.clone();
+            async move { publish_called.load(Ordering::SeqCst) }
+        },
+        Duration::from_secs(60),
+        Duration::from_millis(100),
+    )
+    .await
+    .expect("timed out waiting for merge publish");
+
+    assertions(capture.staged_metadata, capture.replaced_ids, ram_storage).await;
+
+    universe.assert_quit().await;
+}
+
+/// In-memory engine, n=1 output, three multi-metric inputs with overlapping
+/// timestamps and timeseries IDs. Verifies the merge correctly interleaves
+/// rows across all three inputs and produces a single sorted output with
+/// the full input rowset.
+#[tokio::test]
+async fn test_multi_metric_three_input_single_output_in_memory_engine() {
+    run_three_input_multi_metric_merge(
+        false, // use_streaming_engine
+        // Larger than total input size — forces num_outputs = 1.
+        256 * 1024 * 1024,
+        |staged, replaced, storage| async move {
+            assert_three_input_three_metric_single_output_correct(&staged, &replaced, &storage)
+                .await;
+        },
+    )
+    .await;
+}
+
+/// Streaming engine, n=1 output. Same fixture as the in-memory variant;
+/// must produce a row-content-equivalent output. Additionally asserts
+/// `PEAK_BODY_COL_PAGE_CACHE_LEN > 0` to confirm the streaming engine
+/// actually ran (the in-memory engine never writes to that atomic).
+#[allow(
+    clippy::await_holding_lock,
+    reason = "see ms7_serial_lock rationale: std::sync::Mutex on a single-threaded tokio runtime"
+)]
+#[tokio::test]
+async fn test_multi_metric_three_input_single_output_streaming_engine() {
+    use quickwit_parquet_engine::merge::streaming::{
+        PEAK_BODY_COL_PAGE_CACHE_LEN, ms7_serial_lock,
+    };
+
+    let _ms7_guard = ms7_serial_lock();
+    PEAK_BODY_COL_PAGE_CACHE_LEN.store(0, Ordering::Relaxed);
+
+    run_three_input_multi_metric_merge(
+        true, // use_streaming_engine
+        256 * 1024 * 1024,
+        |staged, replaced, storage| async move {
+            assert!(
+                PEAK_BODY_COL_PAGE_CACHE_LEN.load(Ordering::Relaxed) > 0,
+                "streaming engine did not write to PEAK_BODY_COL_PAGE_CACHE_LEN — routing may \
+                 have silently fallen back to the in-memory engine",
+            );
+            assert_three_input_three_metric_single_output_correct(&staged, &replaced, &storage)
+                .await;
+        },
+    )
+    .await;
+}
+
+/// Run a merge pipeline over the canonical three-input fixture with
+/// **prefix-aligned multi-row-group inputs** (`rg_partition_prefix_len = 1`,
+/// one row group per metric_name). Drives the m:n bonus tests.
+async fn run_three_input_prefix_aligned_merge<F, Fut>(
+    use_streaming_engine: bool,
+    target_split_size_bytes: u64,
+    assertions: F,
+) where
+    F: for<'a> FnOnce(
+        Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>>,
+        Arc<std::sync::Mutex<Vec<String>>>,
+        Arc<dyn Storage>,
+    ) -> Fut,
+    Fut: std::future::Future<Output = ()>,
+{
+    quickwit_common::setup_logging_for_tests();
+
+    let universe = Universe::with_accelerated_time();
+    let temp_dir = tempfile::tempdir().unwrap();
+    let ram_storage: Arc<dyn Storage> = Arc::new(RamStorage::default());
+
+    let (_paths, splits) =
+        build_three_prefix_aligned_multi_metric_inputs(temp_dir.path(), &ram_storage).await;
+
+    let capture = mount_capturing_metastore();
+    let params = make_pipeline_params(
+        &universe,
+        capture.metastore.clone(),
+        ram_storage.clone(),
+        use_streaming_engine,
+        target_split_size_bytes,
+        // max_merge_ops = 1: outputs from the first (and only) merge land
+        // at num_merge_ops = 1 and the planner refuses to merge them again,
+        // pinning this test to exactly one merge regardless of how many
+        // outputs the engine chose to produce.
+        1,
+    );
+
+    let pipeline = ParquetMergePipeline::new(params, Some(splits), universe.spawn_ctx());
+    let (_pipeline_mailbox, _pipeline_handle) = universe.spawn_builder().spawn(pipeline);
+
+    wait_until_predicate(
+        || {
+            let publish_called = capture.publish_called.clone();
+            async move { publish_called.load(Ordering::SeqCst) }
+        },
+        Duration::from_secs(60),
+        Duration::from_millis(100),
+    )
+    .await
+    .expect("timed out waiting for merge publish");
+
+    assertions(capture.staged_metadata, capture.replaced_ids, ram_storage).await;
+
+    universe.assert_quit().await;
+}
+
+/// **Bonus** test, in-memory engine: three multi-metric inputs each with
+/// `rg_partition_prefix_len = 1` (one row group per metric_name), merged
+/// with a small `target_split_size_bytes` that forces the executor to ask
+/// the engine for `num_outputs > 1`. Exercises the previously-impossible
+/// pipeline-level m:n merge path now that the executor's hardcoded
+/// `num_outputs = 1` is gone. Verifies the multi-output contract:
+/// sum-equals-total, internal monotonicity, inter-output disjointness on
+/// `sorted_series`, and union-equals-full-set on metrics/services.
+#[tokio::test]
+async fn test_prefix_aligned_multi_metric_three_input_multi_output_in_memory_engine() {
+    run_three_input_prefix_aligned_merge(
+        false, // use_streaming_engine
+        // Smaller than per-input size — guarantees num_outputs ≥ 2. The
+        // engine clamps to available sorted_series boundaries (~60 in
+        // this fixture: 3 metrics × 20 timeseries each), well above 2.
+        500,
+        |staged, replaced, storage| async move {
+            assert_three_input_three_metric_multi_output_correct(
+                &staged,
+                &replaced,
+                &storage,
+                &["split-px", "split-py", "split-pz"],
+            )
+            .await;
+        },
+    )
+    .await;
+}
+
+/// **Bonus** test, streaming engine: same fixture and contract as the
+/// in-memory variant. Additionally asserts
+/// `PEAK_BODY_COL_PAGE_CACHE_LEN > 0` to confirm the streaming engine
+/// actually ran. With prefix-aligned inputs the streaming engine reads
+/// each input's row groups (one per metric_name) through the prefix-aware
+/// `StreamingParquetReader` path — distinct from the legacy
+/// `LegacyInputAdapter` route, since these inputs do not require
+/// promotion (`target_prefix_len_override` is `None` for regular merges).
+#[allow(
+    clippy::await_holding_lock,
+    reason = "see ms7_serial_lock rationale: std::sync::Mutex on a single-threaded tokio runtime"
+)]
+#[tokio::test]
+async fn test_prefix_aligned_multi_metric_three_input_multi_output_streaming_engine() {
+    use quickwit_parquet_engine::merge::streaming::{
+        PEAK_BODY_COL_PAGE_CACHE_LEN, ms7_serial_lock,
+    };
+
+    let _ms7_guard = ms7_serial_lock();
+    PEAK_BODY_COL_PAGE_CACHE_LEN.store(0, Ordering::Relaxed);
+
+    run_three_input_prefix_aligned_merge(
+        true, // use_streaming_engine
+        500,
+        |staged, replaced, storage| async move {
+            assert!(
+                PEAK_BODY_COL_PAGE_CACHE_LEN.load(Ordering::Relaxed) > 0,
+                "streaming engine did not write to PEAK_BODY_COL_PAGE_CACHE_LEN — routing may \
+                 have silently fallen back to the in-memory engine",
+            );
+            assert_three_input_three_metric_multi_output_correct(
+                &staged,
+                &replaced,
+                &storage,
+                &["split-px", "split-py", "split-pz"],
+            )
+            .await;
+        },
+    )
+    .await;
+}
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_sketch_test.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_sketch_test.rs
index 0ac1f5e69c2..f54f5ecd39e 100644
--- a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_sketch_test.rs
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_sketch_test.rs
@@ -172,6 +172,8 @@ async fn test_merge_pipeline_end_to_end_sketches() {
         max_concurrent_split_uploads: 4,
         event_broker: EventBroker::default(),
         writer_config: ParquetWriterConfig::default(),
+        use_streaming_engine: false,
+        target_split_size_bytes: 256 * 1024 * 1024,
     };
 
     let initial_splits = vec![meta_a, meta_b];
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_test.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_test.rs
index 547060668b0..000ee5cb355 100644
--- a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_test.rs
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_test.rs
@@ -405,6 +405,8 @@ async fn test_merge_pipeline_end_to_end() {
         max_concurrent_split_uploads: 4,
         event_broker: EventBroker::default(),
         writer_config: ParquetWriterConfig::default(),
+        use_streaming_engine: false,
+        target_split_size_bytes: 256 * 1024 * 1024,
     };
 
     let initial_splits = vec![meta_a, meta_b];
@@ -424,6 +426,43 @@ async fn test_merge_pipeline_end_to_end() {
     .await
     .expect("timed out waiting for merge publish");
 
+    assert_cpu_mem_merge_outputs_correct(&staged_metadata, &replaced_ids, &ram_storage).await;
+
+    universe.assert_quit().await;
+}
+
+/// Asserts the post-merge state captured from the metastore mock and the
+/// Parquet file in storage for the canonical two-input fixture used by
+/// the end-to-end tests:
+///
+/// - `split-a`: 50 rows of `cpu.usage`, ts 100..150, service `web`, host `host-1`
+/// - `split-b`: 50 rows of `mem.usage`, ts 200..250, service `api`, host `host-2`
+///
+/// Both engines (in-memory and streaming) must produce a merged split that
+/// passes every check below — that is the **contract** of a regular merge,
+/// independent of which engine ran. Driven by both
+/// `test_merge_pipeline_end_to_end` (in-memory engine via the flag default)
+/// and `test_merge_pipeline_end_to_end_with_streaming_engine_flag`
+/// (streaming engine via the flag).
+///
+/// Verifies, in order:
+/// - **Step 5**: `replaced_split_ids = [split-a, split-b]`.
+/// - **Step 6**: staged `ParquetSplitMetadata` — `num_rows = 100`, `time_range = [100, 250)`,
+///   `metric_names = {cpu.usage, mem.usage}`, `num_merge_ops = 1`, `sort_fields` matches the table
+///   config, `row_keys_proto` present + non-empty, `zonemap_regexes` contains `metric_name`,
+///   `low_cardinality_tags["service"] = {web, api}`.
+/// - **Step 7**: merged Parquet file content — row count, all 100 timestamps match expected and
+///   metadata `time_range`, both service / host values survive, `sorted_series` is monotonically
+///   non-decreasing, and `cpu.usage` rows precede `mem.usage` rows (the global sort order
+///   semantics).
+/// - **Step 8**: Parquet KV metadata — `qh.sort_fields`, `qh.num_merge_ops`, `qh.row_keys`
+///   (non-empty), `qh.zonemap_regexes` parses as JSON and contains `metric_name`, and
+///   cross-validates with the staged metadata.
+async fn assert_cpu_mem_merge_outputs_correct(
+    staged_metadata: &Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>>,
+    replaced_ids: &Arc<std::sync::Mutex<Vec<String>>>,
+    ram_storage: &Arc<dyn Storage>,
+) {
     // --- Step 5: Verify replaced_split_ids ---
 
     let mut replaced_sorted: Vec<String> = replaced_ids.lock().unwrap().clone();
@@ -598,8 +637,8 @@ async fn test_merge_pipeline_end_to_end() {
             sorted_series[i] >= sorted_series[i - 1],
             "sorted_series must be monotonically non-decreasing at row {}: {:?} < {:?}",
             i,
-            &sorted_series[i],
-            &sorted_series[i - 1]
+            sorted_series[i],
+            sorted_series[i - 1]
         );
     }
 
@@ -663,6 +702,174 @@ async fn test_merge_pipeline_end_to_end() {
         merged_meta.zonemap_regexes, zonemaps_parsed,
         "metadata zonemap_regexes must match Parquet qh.zonemap_regexes"
     );
+}
+
+/// End-to-end production-path test for the YAML-flag-gated streaming
+/// engine wire-in. Runs the full actor chain — planner → downloader →
+/// executor → uploader → publisher — with
+/// `ParquetMergePipelineParams::use_streaming_engine = true` and asserts
+/// the merge completed correctly.
+///
+/// The executor's branch is `if is_promotion || use_streaming_engine
+/// { streaming } else { in-memory }`. With promotion off and
+/// `use_streaming_engine = true`, the streaming engine is the only
+/// reachable path — there is no silent fallback. The test confirms
+/// this by:
+///
+/// 1. Asserting the merge published with the right `replaced_split_ids` (the merge actually ran
+///    end-to-end through the executor).
+/// 2. Reading `PEAK_BODY_COL_PAGE_CACHE_LEN` and asserting it is non-zero (the streaming engine
+///    writes to this atomic on every body-col page assembly; if the in-memory engine had run
+///    instead the counter would stay at zero).
+/// 3. Asserting row count and metric names on the output match the inputs (the streaming engine
+///    produces correct results, not just "something").
+#[allow(
+    clippy::await_holding_lock,
+    reason = "the lock is `std::sync::Mutex` and the `#[tokio::test]` runtime is single-threaded, \
+              so holding the guard across await won't deadlock another thread — see \
+              `ms7_serial_lock` rationale"
+)]
+#[tokio::test]
+async fn test_merge_pipeline_end_to_end_with_streaming_engine_flag() {
+    use std::sync::atomic::Ordering;
+
+    use quickwit_parquet_engine::merge::streaming::{
+        PEAK_BODY_COL_PAGE_CACHE_LEN, ms7_serial_lock,
+    };
+
+    // Serialise against every other test in this binary that runs a
+    // streaming merge: they all touch the same process-global atomic,
+    // and a concurrent `store(0)` would race our load. The lock is
+    // re-exported from `quickwit_parquet_engine` under the `testsuite`
+    // feature for exactly this cross-crate case.
+    let _ms7_guard = ms7_serial_lock();
+    PEAK_BODY_COL_PAGE_CACHE_LEN.store(0, Ordering::Relaxed);
+
+    quickwit_common::setup_logging_for_tests();
+
+    let universe = Universe::with_accelerated_time();
+    let temp_dir = tempfile::tempdir().unwrap();
+    let ram_storage: Arc<dyn Storage> = Arc::new(RamStorage::default());
+
+    let batch_a = create_custom_test_batch("cpu.usage", 100, 50, "web", "host-1");
+    let meta_a = make_test_split_metadata("split-a", 50, 0, 100, "cpu.usage");
+    let size_a = write_test_parquet_file(temp_dir.path(), "split-a.parquet", &batch_a, &meta_a);
+    let meta_a = {
+        let mut m = meta_a;
+        m.size_bytes = size_a;
+        m.parquet_file = "split-a.parquet".to_string();
+        m
+    };
+    let batch_b = create_custom_test_batch("mem.usage", 200, 50, "api", "host-2");
+    let meta_b = make_test_split_metadata("split-b", 50, 0, 200, "mem.usage");
+    let size_b = write_test_parquet_file(temp_dir.path(), "split-b.parquet", &batch_b, &meta_b);
+    let meta_b = {
+        let mut m = meta_b;
+        m.size_bytes = size_b;
+        m.parquet_file = "split-b.parquet".to_string();
+        m
+    };
+
+    let content_a = std::fs::read(temp_dir.path().join("split-a.parquet")).unwrap();
+    ram_storage
+        .put(Path::new("split-a.parquet"), Box::new(content_a))
+        .await
+        .unwrap();
+    let content_b = std::fs::read(temp_dir.path().join("split-b.parquet")).unwrap();
+    ram_storage
+        .put(Path::new("split-b.parquet"), Box::new(content_b))
+        .await
+        .unwrap();
+
+    let mut mock_metastore = MockMetastoreService::new();
+    let staged_metadata: Arc<std::sync::Mutex<Vec<ParquetSplitMetadata>>> =
+        Arc::new(std::sync::Mutex::new(Vec::new()));
+    let staged_metadata_clone = staged_metadata.clone();
+    mock_metastore.expect_stage_metrics_splits().returning(
+        move |request: StageMetricsSplitsRequest| {
+            let splits = request
+                .deserialize_splits_metadata()
+                .expect("failed to deserialize staged metadata");
+            staged_metadata_clone.lock().unwrap().extend(splits);
+            Ok(EmptyResponse {})
+        },
+    );
+    let publish_called = Arc::new(std::sync::atomic::AtomicBool::new(false));
+    let publish_called_clone = publish_called.clone();
+    let replaced_ids = Arc::new(std::sync::Mutex::new(Vec::<String>::new()));
+    let replaced_ids_clone = replaced_ids.clone();
+    mock_metastore
+        .expect_publish_metrics_splits()
+        .returning(move |request| {
+            replaced_ids_clone
+                .lock()
+                .unwrap()
+                .extend(request.replaced_split_ids.clone());
+            publish_called_clone.store(true, Ordering::SeqCst);
+            Ok(EmptyResponse {})
+        });
+    let metastore = MetastoreServiceClient::from_mock(mock_metastore);
+
+    let merge_policy = Arc::new(ConstWriteAmplificationParquetMergePolicy::new(
+        ParquetMergePolicyConfig {
+            merge_factor: 2,
+            max_merge_factor: 2,
+            max_merge_ops: 5,
+            target_split_size_bytes: 256 * 1024 * 1024,
+            maturation_period: Duration::from_secs(3600),
+            max_finalize_merge_operations: 3,
+        },
+    ));
+
+    let params = ParquetMergePipelineParams {
+        index_uid: quickwit_proto::types::IndexUid::for_test("test-merge-index-streaming", 0),
+        indexing_directory: TempDirectory::for_test(),
+        metastore,
+        storage: ram_storage.clone(),
+        merge_policy,
+        merge_scheduler_service: universe.get_or_spawn_one(),
+        max_concurrent_split_uploads: 4,
+        event_broker: EventBroker::default(),
+        writer_config: ParquetWriterConfig::default(),
+        // This is the bit under test: regular merges must route through
+        // `execute_merge_operation`, not the in-memory fallback.
+        use_streaming_engine: true,
+        target_split_size_bytes: 256 * 1024 * 1024,
+    };
+
+    let initial_splits = vec![meta_a, meta_b];
+    let pipeline = ParquetMergePipeline::new(params, Some(initial_splits), universe.spawn_ctx());
+    let (_pipeline_mailbox, _pipeline_handle) = universe.spawn_builder().spawn(pipeline);
+
+    wait_until_predicate(
+        || {
+            let publish_called = publish_called.clone();
+            async move { publish_called.load(Ordering::SeqCst) }
+        },
+        Duration::from_secs(30),
+        Duration::from_millis(100),
+    )
+    .await
+    .expect("timed out waiting for streaming-engine merge publish");
+
+    // The streaming engine writes to `PEAK_BODY_COL_PAGE_CACHE_LEN` on
+    // every body-col page assembly; the in-memory engine never touches
+    // it. A non-zero post-merge value is direct evidence the streaming
+    // engine ran the body-col path — distinguishes this test from a
+    // silent fallback to the in-memory engine.
+    let peak = PEAK_BODY_COL_PAGE_CACHE_LEN.load(Ordering::Relaxed);
+    assert!(
+        peak > 0,
+        "streaming engine did not write to PEAK_BODY_COL_PAGE_CACHE_LEN — routing may have \
+         silently fallen back to the in-memory engine",
+    );
+
+    // Same correctness contract as the in-memory variant: every check on
+    // the merged metadata, the Parquet file content, and the Parquet KV
+    // headers must hold regardless of which engine ran. This shared
+    // helper is the executable parity between engines at the
+    // pipeline-integration level.
+    assert_cpu_mem_merge_outputs_correct(&staged_metadata, &replaced_ids, &ram_storage).await;
 
     universe.assert_quit().await;
 }
diff --git a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_trace_conformance_test.rs b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_trace_conformance_test.rs
index 20ab86cfc02..e5a25a9de7a 100644
--- a/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_trace_conformance_test.rs
+++ b/quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_pipeline_trace_conformance_test.rs
@@ -628,6 +628,8 @@ async fn test_trace_conformance_normal_path() {
         max_concurrent_split_uploads: 4,
         event_broker: EventBroker::default(),
         writer_config: ParquetWriterConfig::default(),
+        use_streaming_engine: false,
+        target_split_size_bytes: 256 * 1024 * 1024,
     };
 
     let pipeline =
@@ -747,6 +749,8 @@ async fn test_trace_conformance_crash_mid_cascade() {
         max_concurrent_split_uploads: 4,
         event_broker: EventBroker::default(),
         writer_config: ParquetWriterConfig::default(),
+        use_streaming_engine: false,
+        target_split_size_bytes: 256 * 1024 * 1024,
     };
 
     let pipeline =
diff --git a/quickwit/quickwit-parquet-engine/src/merge/metadata_aggregation.rs b/quickwit/quickwit-parquet-engine/src/merge/metadata_aggregation.rs
index ddf254f0865..0805f6ef35c 100644
--- a/quickwit/quickwit-parquet-engine/src/merge/metadata_aggregation.rs
+++ b/quickwit/quickwit-parquet-engine/src/merge/metadata_aggregation.rs
@@ -40,10 +40,17 @@ use crate::split::{ParquetSplitId, ParquetSplitMetadata};
 /// # Preconditions
 ///
 /// All input splits must share the same kind, index_uid, partition_id,
-/// sort_fields, and window.
+/// sort_fields, and window. In the default case (`mixed_prefix_ok = false`)
+/// they must also share `rg_partition_prefix_len`. In legacy-promotion
+/// mode (`mixed_prefix_ok = true`) the prefix-len equality check is
+/// skipped because inputs come from different prefix buckets — the
+/// output's prefix_len is taken from the writer's KV stamp via
+/// `output.output_rg_partition_prefix_len` (CS-1), so the input-side
+/// equality is no longer load-bearing for the metastore record.
 pub fn merge_parquet_split_metadata(
     inputs: &[ParquetSplitMetadata],
     output: &MergeOutputFile,
+    mixed_prefix_ok: bool,
 ) -> Result<ParquetSplitMetadata> {
     if inputs.is_empty() {
         bail!("merge_parquet_split_metadata requires at least one input split");
@@ -93,10 +100,11 @@ pub fn merge_parquet_split_metadata(
                 first.window
             );
         }
-        if input.rg_partition_prefix_len != first.rg_partition_prefix_len {
+        if !mixed_prefix_ok && input.rg_partition_prefix_len != first.rg_partition_prefix_len {
             bail!(
                 "input {} has rg_partition_prefix_len {}, expected {} — splits with different \
-                 prefix lengths must not appear in the same merge",
+                 prefix lengths must not appear in the same regular merge (legacy-promotion \
+                 operations bypass this check)",
                 i,
                 input.rg_partition_prefix_len,
                 first.rg_partition_prefix_len
@@ -119,22 +127,18 @@ pub fn merge_parquet_split_metadata(
     let split_id = ParquetSplitId::generate(first.kind);
     let parquet_file = format!("{split_id}.parquet");
 
-    // `rg_partition_prefix_len` propagation rule: a single-row-group
-    // output vacuously satisfies any prefix claim (no boundary to
-    // misalign), so we keep the inputs' prefix. Multi-RG output with
-    // arbitrary row-count-driven boundaries (the only kind the current
-    // merge writer can produce) cannot honor a non-zero claim and must
-    // reset to 0. PR-6 (streaming column-major merge engine) will
-    // produce sort-prefix-aligned multi-RG output and propagate the
-    // prefix unconditionally.
-    //
-    // This must agree with the value the writer embeds in the file's
-    // `qh.rg_partition_prefix_len` KV — see `write_merge_outputs`.
-    let output_prefix_len = if output.num_row_groups <= 1 {
-        first.rg_partition_prefix_len
-    } else {
-        0
-    };
+    // CS-1: the metastore-recorded `rg_partition_prefix_len` must equal
+    // the value the writer embedded in the file's
+    // `qh.rg_partition_prefix_len` KV. Each writer makes its own
+    // decision (the legacy `merge/writer.rs` demotes to 0 on multi-RG
+    // output because its boundaries are row-count-driven; the streaming
+    // writer propagates the inputs' prefix unchanged because it splits
+    // at prefix transitions and verifies via `assert_unique_rg_prefix_keys`)
+    // and reports it via `MergeOutputFile.output_rg_partition_prefix_len`.
+    // We propagate that one source of truth — re-deriving here from
+    // `num_row_groups` would silently diverge from the streaming
+    // engine's prefix-aligned multi-RG output.
+    let output_prefix_len = output.output_rg_partition_prefix_len;
 
     // Data-dependent fields come from the MergeOutputFile (extracted from
     // this output's actual rows during the merge write pass).
@@ -212,11 +216,30 @@ mod tests {
         num_row_groups: usize,
         time_range: (u64, u64),
         metric_names: &[&str],
+    ) -> MergeOutputFile {
+        make_output_full_with_prefix(
+            num_rows,
+            size_bytes,
+            num_row_groups,
+            0,
+            time_range,
+            metric_names,
+        )
+    }
+
+    fn make_output_full_with_prefix(
+        num_rows: usize,
+        size_bytes: u64,
+        num_row_groups: usize,
+        output_rg_partition_prefix_len: u32,
+        time_range: (u64, u64),
+        metric_names: &[&str],
     ) -> MergeOutputFile {
         MergeOutputFile {
             path: PathBuf::from("/tmp/merged.parquet"),
             num_rows,
             num_row_groups,
+            output_rg_partition_prefix_len,
             size_bytes,
             row_keys_proto: Some(vec![0x08, 0x01]),
             zonemap_regexes: HashMap::from([("metric_name".to_string(), "cpu\\..*".to_string())]),
@@ -233,7 +256,7 @@ mod tests {
             make_test_split("s1", (1200, 2000), 0),
         ];
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
+        let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
 
         // Invariant fields come from inputs.
         assert_eq!(result.kind, ParquetSplitKind::Metrics);
@@ -252,7 +275,7 @@ mod tests {
             make_test_split("s1", (1200, 2000), 0),
         ];
         let output = make_output_with_metadata(200, 9000, (1000, 2000), &["cpu.usage", "mem.used"]);
-        let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
+        let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
 
         // Data-dependent fields come from the output, not inputs.
         assert_eq!(result.time_range.start_secs, 1000);
@@ -287,7 +310,7 @@ mod tests {
             .or_default()
             .insert("api".to_string());
 
-        let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
+        let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
 
         let service_values = result.low_cardinality_tags.get("service").unwrap();
         assert_eq!(service_values.len(), 2);
@@ -308,7 +331,7 @@ mod tests {
                 .insert(format!("host-{i}"));
         }
 
-        let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
+        let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
 
         assert!(result.high_cardinality_tag_keys.contains("host"));
         assert!(!result.low_cardinality_tags.contains_key("host"));
@@ -322,7 +345,7 @@ mod tests {
             make_test_split("s2", (1000, 2000), 2),
         ];
         let output = make_output(300, 12000);
-        let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
+        let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
 
         assert_eq!(result.num_merge_ops, 3); // max(2,2,2) + 1
     }
@@ -330,7 +353,7 @@ mod tests {
     #[test]
     fn test_empty_inputs_error() {
         let output = make_output(0, 0);
-        let result = merge_parquet_split_metadata(&[], &output);
+        let result = merge_parquet_split_metadata(&[], &output, false);
         assert!(result.is_err());
         assert!(
             result
@@ -347,7 +370,7 @@ mod tests {
         s1.kind = ParquetSplitKind::Sketches;
 
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
         assert!(result.is_err());
     }
 
@@ -358,7 +381,7 @@ mod tests {
         s1.index_uid = "other-index:00000000000000000000000002".to_string();
 
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
         assert!(result.is_err());
     }
 
@@ -369,7 +392,7 @@ mod tests {
         s1.partition_id = 99;
 
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
         assert!(result.is_err());
     }
 
@@ -380,7 +403,7 @@ mod tests {
         s1.sort_fields = "different|schema/V2".to_string();
 
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
         assert!(result.is_err());
     }
 
@@ -391,7 +414,7 @@ mod tests {
         s1.window = Some(2000..5600);
 
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
         assert!(result.is_err());
     }
 
@@ -402,7 +425,7 @@ mod tests {
         s1.rg_partition_prefix_len = 1;
 
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
         let err = result.expect_err("merge must reject mismatched prefix lengths");
         let msg = err.to_string();
         assert!(
@@ -412,38 +435,98 @@ mod tests {
     }
 
     #[test]
-    fn test_output_prefix_len_demoted_when_multi_rg() {
-        // The current merge writer rolls over RGs at row count, not at
-        // sort-prefix transitions. When the output ends up with > 1 RG,
-        // the boundaries are at arbitrary places and the inputs' prefix
-        // claim cannot be honored — the output's prefix must be 0.
+    fn test_output_prefix_len_carries_writers_value_when_demoted() {
+        // CS-1: the metastore-recorded value must match the writer's
+        // KV stamp. Legacy `merge/writer.rs` demotes to 0 when its
+        // row-count-driven RG layout produces multi-RG output and
+        // reports that demoted value on the `MergeOutputFile`. The
+        // metastore aggregator must propagate it as-is (NOT re-derive
+        // from inputs) so the metastore agrees with the file's KV.
         let mut s0 = make_test_split("s0", (1000, 2000), 0);
         let mut s1 = make_test_split("s1", (1000, 2000), 0);
         s0.rg_partition_prefix_len = 3;
         s1.rg_partition_prefix_len = 3;
 
-        let output = make_output_full(200, 9000, 2, (1000, 2000), &["cpu.usage"]);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output).unwrap();
+        // num_row_groups = 2 + writer reports demoted prefix_len = 0
+        // (the legacy writer's choice for a row-count-driven multi-RG).
+        let output = make_output_full_with_prefix(200, 9000, 2, 0, (1000, 2000), &["cpu.usage"]);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false).unwrap();
         assert_eq!(result.rg_partition_prefix_len, 0);
     }
 
     #[test]
     fn test_output_prefix_len_preserved_when_single_rg() {
         // A single-RG output vacuously satisfies any prefix alignment
-        // claim (one RG, no boundary to misalign). Propagate the inputs'
-        // prefix so the merge output stays in the same compaction bucket
-        // as the inputs, instead of leaking into the prefix=0 bucket on
-        // every merge.
+        // claim (one RG, no boundary to misalign). The writer reports
+        // the inputs' prefix; aggregator propagates it.
         let mut s0 = make_test_split("s0", (1000, 2000), 0);
         let mut s1 = make_test_split("s1", (1000, 2000), 0);
         s0.rg_partition_prefix_len = 3;
         s1.rg_partition_prefix_len = 3;
 
-        let output = make_output_full(200, 9000, 1, (1000, 2000), &["cpu.usage"]);
-        let result = merge_parquet_split_metadata(&[s0, s1], &output).unwrap();
+        let output = make_output_full_with_prefix(200, 9000, 1, 3, (1000, 2000), &["cpu.usage"]);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false).unwrap();
         assert_eq!(result.rg_partition_prefix_len, 3);
     }
 
+    #[test]
+    fn test_mixed_prefix_ok_skips_input_equality_check() {
+        // Promotion mode: inputs come from different prefix buckets
+        // (e.g. one prefix_len=0 legacy + one prefix_len=2 aligned).
+        // With `mixed_prefix_ok = true` the aggregator must accept this
+        // and take the output prefix from the writer's stamp.
+        let mut legacy = make_test_split("s0", (1000, 2000), 0);
+        legacy.rg_partition_prefix_len = 0;
+        let mut aligned = make_test_split("s1", (1000, 2000), 0);
+        aligned.rg_partition_prefix_len = 2;
+
+        // Writer stamps prefix_len = 2 on the multi-RG output (streaming
+        // engine output that successfully promoted the legacy input).
+        let output = make_output_full_with_prefix(300, 12000, 3, 2, (1000, 2000), &["cpu.usage"]);
+
+        let result =
+            merge_parquet_split_metadata(&[legacy.clone(), aligned.clone()], &output, true)
+                .expect("mixed-prefix inputs must be accepted in promotion mode");
+        assert_eq!(
+            result.rg_partition_prefix_len, 2,
+            "output prefix matches the writer's stamp (CS-1)",
+        );
+
+        // Same inputs without the mixed_prefix_ok flag must still fail.
+        let strict = merge_parquet_split_metadata(&[legacy, aligned], &output, false);
+        let err = strict.expect_err("strict mode must reject mixed-prefix inputs");
+        assert!(
+            err.to_string().contains("rg_partition_prefix_len"),
+            "error should mention the prefix-len mismatch, got: {err}",
+        );
+    }
+
+    #[test]
+    fn test_output_prefix_len_preserved_on_multi_rg_streaming_engine() {
+        // CS-1 regression for F1: the streaming engine produces
+        // sort-prefix-aligned multi-RG output and reports the inputs'
+        // prefix unchanged via `MergeOutputFile.output_rg_partition_prefix_len`.
+        // Before this fix, `merge_parquet_split_metadata` would
+        // unconditionally demote to 0 whenever `num_row_groups > 1`,
+        // breaking CS-1 (metastore disagreed with the file's KV) and
+        // leaking aligned outputs into the unaligned compaction bucket
+        // on every subsequent merge.
+        let mut s0 = make_test_split("s0", (1000, 2000), 0);
+        let mut s1 = make_test_split("s1", (1000, 2000), 0);
+        s0.rg_partition_prefix_len = 2;
+        s1.rg_partition_prefix_len = 2;
+
+        // num_row_groups = 3 (multi-RG) AND writer reports prefix_len = 2
+        // (the streaming engine's stamp because it verified alignment).
+        let output = make_output_full_with_prefix(300, 12000, 3, 2, (1000, 2000), &["cpu.usage"]);
+        let result = merge_parquet_split_metadata(&[s0, s1], &output, false).unwrap();
+        assert_eq!(
+            result.rg_partition_prefix_len, 2,
+            "metastore must mirror the writer's KV (CS-1); multi-RG aligned output keeps its \
+             prefix claim"
+        );
+    }
+
     #[test]
     fn test_fresh_split_id_generated() {
         let inputs = vec![
@@ -451,7 +534,7 @@ mod tests {
             make_test_split("s1", (1000, 2000), 0),
         ];
         let output = make_output(200, 9000);
-        let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
+        let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
 
         assert_ne!(result.split_id.as_str(), "s0");
         assert_ne!(result.split_id.as_str(), "s1");
@@ -467,7 +550,7 @@ mod tests {
         output.row_keys_proto = None;
         output.zonemap_regexes = HashMap::new();
 
-        let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
+        let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
 
         assert!(result.row_keys_proto.is_none());
         assert!(result.zonemap_regexes.is_empty());
diff --git a/quickwit/quickwit-parquet-engine/src/merge/mod.rs b/quickwit/quickwit-parquet-engine/src/merge/mod.rs
index 008c456208a..c32b130e170 100644
--- a/quickwit/quickwit-parquet-engine/src/merge/mod.rs
+++ b/quickwit/quickwit-parquet-engine/src/merge/mod.rs
@@ -24,12 +24,14 @@ mod merge_order;
 pub mod metadata_aggregation;
 pub mod policy;
 mod schema;
+pub mod streaming;
 mod writer;
 
 #[cfg(test)]
 mod tests;
 
 use std::path::{Path, PathBuf};
+use std::sync::Arc;
 
 use anyhow::{Context, Result, bail};
 use arrow::array::RecordBatch;
@@ -40,8 +42,9 @@ pub use self::merge_order::MergeRun;
 use crate::sort_fields::{equivalent_schemas_for_compaction, parse_sort_fields};
 use crate::sorted_series::SORTED_SERIES_COLUMN;
 use crate::storage::{
-    PARQUET_META_NUM_MERGE_OPS, PARQUET_META_RG_PARTITION_PREFIX_LEN, PARQUET_META_SORT_FIELDS,
-    PARQUET_META_WINDOW_DURATION, PARQUET_META_WINDOW_START, ParquetWriterConfig,
+    ColumnPageStream, LegacyInputAdapter, PARQUET_META_NUM_MERGE_OPS,
+    PARQUET_META_RG_PARTITION_PREFIX_LEN, PARQUET_META_SORT_FIELDS, PARQUET_META_WINDOW_DURATION,
+    PARQUET_META_WINDOW_START, ParquetWriterConfig, RemoteByteSource, StreamingParquetReader,
 };
 
 /// Configuration for a merge operation.
@@ -64,19 +67,19 @@ pub struct MergeConfig {
 /// Metadata extracted from input files' Parquet KV metadata.
 /// All inputs must agree on sort_fields, window_start, window_duration,
 /// and rg_partition_prefix_len.
-struct InputMetadata {
+#[derive(Clone)]
+pub(crate) struct InputMetadata {
     sort_fields: String,
     window_start_secs: Option<i64>,
     window_duration_secs: u32,
     num_merge_ops: u32,
     /// Number of leading sort columns whose transitions align with row
     /// group boundaries. All input files must agree on this value (it's
-    /// part of the compaction scope key). Splitting row groups at the
-    /// claimed prefix boundary is not implemented by the current merge
-    /// writer — it lands in PR-6 (streaming column-major merge engine).
-    /// Until then, the *output* file is written with prefix 0 regardless
-    /// of this value.
-    #[allow(dead_code)] // wired for PR-6 streaming engine; PR-1 only validates.
+    /// part of the compaction scope key). The streaming merge engine
+    /// (PR-6c.2) honours this on input AND produces prefix-aligned
+    /// output: when inputs have `prefix_len == 0`, the engine
+    /// synthesizes prefix-aligned regions from the merge order and
+    /// promotes the output's `rg_partition_prefix_len` accordingly.
     rg_partition_prefix_len: u32,
 }
 
@@ -94,15 +97,22 @@ pub struct MergeOutputFile {
     /// Number of rows in this output file.
     pub num_rows: usize,
 
-    /// Number of row groups the writer produced for this file. Used by
-    /// `merge_parquet_split_metadata` to decide whether the input prefix
-    /// alignment claim (`rg_partition_prefix_len`) can be propagated to
-    /// the output: a single-RG file vacuously satisfies any claim, so
-    /// we keep the inputs' prefix; a multi-RG file with arbitrary
-    /// boundaries (the only kind the current writer can produce) must
-    /// reset the claim to 0.
+    /// Number of row groups the writer produced for this file.
     pub num_row_groups: usize,
 
+    /// `qh.rg_partition_prefix_len` value the writer embedded in this
+    /// file's KV metadata. The legacy `merge/writer.rs` writer demotes
+    /// to 0 when it produces multi-RG output (its RG boundaries are
+    /// row-count-driven, not prefix-aligned). The streaming writer
+    /// (`merge/streaming/output.rs`) propagates the inputs' prefix
+    /// unchanged because it splits at prefix transitions AND
+    /// `assert_unique_rg_prefix_keys` verifies the file. Carrying the
+    /// value here lets `merge_parquet_split_metadata` (CS-1: metastore
+    /// == KV) propagate it directly to `ParquetSplitMetadata` instead
+    /// of re-deriving — preventing the metastore from disagreeing with
+    /// the on-disk KV when both engines coexist.
+    pub output_rg_partition_prefix_len: u32,
+
     /// File size in bytes.
     pub size_bytes: u64,
 
@@ -475,3 +485,78 @@ fn extract_and_validate_input_metadata(paths: &[PathBuf]) -> Result<InputMetadat
         rg_partition_prefix_len: consensus_prefix_len.unwrap_or(0),
     })
 }
+
+/// Execute a [`policy::ParquetMergeOperation`] by opening each input through
+/// the appropriate `ColumnPageStream` impl, then feeding the streams
+/// to the streaming merge engine.
+///
+/// Routing per input:
+/// - **Regular merge** (`op.target_prefix_len_override == None`): every split is opened directly
+///   via [`StreamingParquetReader`]. MP-3 already requires inputs to share
+///   `rg_partition_prefix_len`, so the streaming engine sees uniform metadata.
+/// - **Promotion merge** (`op.target_prefix_len_override == Some(target)`): splits with
+///   `rg_partition_prefix_len < target` are opened through [`LegacyInputAdapter`] with the same
+///   target — the adapter re-encodes the file in memory as prefix-aligned and rewrites the
+///   `qh.rg_partition_prefix_len` KV. Splits already at `target` are opened directly. The streaming
+///   engine then consumes a homogeneous stream advertising `prefix_len = target` on every input.
+///
+/// `sources` is parallel to `op.splits`: `sources[i]` provides byte-
+/// range reads against `op.splits[i].parquet_file`. The caller (e.g.
+/// the executor wrapper that lives outside this crate) is responsible
+/// for materializing one [`RemoteByteSource`] per split based on its
+/// storage backend (S3, local FS, etc.). Splits with names that
+/// cannot be opened by the source surface as `LegacyAdapterError::Io`
+/// or `ParquetReadError`.
+///
+/// Returns the merge engine's [`MergeOutputFile`]s. Conversion to
+/// `ParquetSplitMetadata` for the metastore is the caller's
+/// responsibility — use [`metadata_aggregation::merge_parquet_split_metadata`]
+/// with `mixed_prefix_ok = op.target_prefix_len_override.is_some()`.
+pub async fn execute_merge_operation(
+    op: &policy::ParquetMergeOperation,
+    sources: Vec<Arc<dyn RemoteByteSource>>,
+    output_dir: &Path,
+    config: &MergeConfig,
+) -> Result<Vec<MergeOutputFile>> {
+    if sources.len() != op.splits.len() {
+        bail!(
+            "execute_merge_operation: sources.len() ({}) != op.splits.len() ({})",
+            sources.len(),
+            op.splits.len(),
+        );
+    }
+
+    let mut streams: Vec<Box<dyn ColumnPageStream>> = Vec::with_capacity(op.splits.len());
+    for (split, source) in op.splits.iter().zip(sources) {
+        let path = PathBuf::from(&split.parquet_file);
+        let stream: Box<dyn ColumnPageStream> = match op.target_prefix_len_override {
+            Some(target) if split.rg_partition_prefix_len < target => {
+                // Promote this legacy input. The adapter re-encodes in
+                // memory and presents itself as a prefix_len = target
+                // single-RG stream to the merge engine.
+                let adapter = LegacyInputAdapter::try_open(source, path, target)
+                    .await
+                    .with_context(|| {
+                        format!(
+                            "opening legacy adapter for split {} with target_prefix_len = {target}",
+                            split.split_id,
+                        )
+                    })?;
+                Box::new(adapter)
+            }
+            _ => {
+                // Direct streaming reader: regular merge, or promotion
+                // where this input already satisfies the target.
+                let reader = StreamingParquetReader::try_open(source, path)
+                    .await
+                    .with_context(|| {
+                        format!("opening streaming reader for split {}", split.split_id)
+                    })?;
+                Box::new(reader)
+            }
+        };
+        streams.push(stream);
+    }
+
+    streaming::streaming_merge_sorted_parquet_files(streams, output_dir, config).await
+}
diff --git a/quickwit/quickwit-parquet-engine/src/merge/policy/mod.rs b/quickwit/quickwit-parquet-engine/src/merge/policy/mod.rs
index f373f71bd0c..3a0a72d7da8 100644
--- a/quickwit/quickwit-parquet-engine/src/merge/policy/mod.rs
+++ b/quickwit/quickwit-parquet-engine/src/merge/policy/mod.rs
@@ -40,13 +40,25 @@ use crate::split::{ParquetSplitId, ParquetSplitKind, ParquetSplitMetadata};
 pub struct ParquetMergeOperation {
     /// New split ID for the merged output.
     pub merge_split_id: ParquetSplitId,
-    /// The input splits being merged (all from the same compaction scope and
-    /// `num_merge_ops` level).
+    /// The input splits being merged. All share the same `num_merge_ops`
+    /// level and the same windowed compaction scope (sort_fields +
+    /// window). In the default form they also share `rg_partition_prefix_len`;
+    /// in promotion mode (`target_prefix_len_override` is `Some`) they
+    /// may differ in prefix length up to the target — see
+    /// [`Self::promote_legacy`].
     pub splits: Vec<ParquetSplitMetadata>,
+    /// When `Some(N)`, this operation is a legacy-promotion merge:
+    /// inputs may carry `rg_partition_prefix_len < N` (they will be
+    /// opened through [`crate::storage::LegacyInputAdapter`] with
+    /// `target_prefix_len = N`) and the output will declare
+    /// `rg_partition_prefix_len = N`. When `None`, the operation is a
+    /// regular merge and all inputs must agree on prefix_len (default
+    /// behaviour).
+    pub target_prefix_len_override: Option<u32>,
 }
 
 impl ParquetMergeOperation {
-    /// Create a new merge operation consuming the given splits.
+    /// Create a regular merge operation consuming the given splits.
     ///
     /// Generates a fresh split ID for the merged output. The `kind` is inferred
     /// from the first split (all splits in a merge share the same kind).
@@ -56,7 +68,66 @@ impl ParquetMergeOperation {
     /// - **MP-1**: all splits share the same `num_merge_ops` level
     /// - **MP-2**: at least 2 input splits
     /// - **MP-3**: all splits share the same compaction scope (sort_fields + window)
+    ///
+    /// For legacy-promotion operations (inputs at different
+    /// `rg_partition_prefix_len`), use [`Self::promote_legacy`] instead.
     pub fn new(splits: Vec<ParquetSplitMetadata>) -> Self {
+        Self::check_mp1_mp2_mp3(&splits);
+        let kind = splits
+            .first()
+            .map(|s| s.kind)
+            .unwrap_or(ParquetSplitKind::Metrics);
+        Self {
+            merge_split_id: ParquetSplitId::generate(kind),
+            splits,
+            target_prefix_len_override: None,
+        }
+    }
+
+    /// Create a legacy-promotion merge operation.
+    ///
+    /// Inputs may have heterogeneous `rg_partition_prefix_len` as long
+    /// as every input's value is `<= target_prefix_len`. The executor
+    /// opens any input with `prefix_len < target_prefix_len` through
+    /// [`crate::storage::LegacyInputAdapter`] with `target` set to the
+    /// override; inputs already at the target are opened directly via
+    /// the streaming reader.
+    ///
+    /// All other MP-3 dimensions (sort_fields, window) still must
+    /// agree — only the prefix-len equality is relaxed.
+    ///
+    /// # Panics (debug builds) / metrics (all builds)
+    ///
+    /// - **MP-1**: all splits share the same `num_merge_ops` level
+    /// - **MP-2**: at least 2 input splits
+    /// - **MP-3 (relaxed)**: all splits share sort_fields + window
+    /// - All inputs' `rg_partition_prefix_len <= target_prefix_len`. Inputs above the target are a
+    ///   planner bug — they shouldn't be demoted, only promoted.
+    pub fn promote_legacy(splits: Vec<ParquetSplitMetadata>, target_prefix_len: u32) -> Self {
+        Self::check_mp1_mp2_mp3(&splits);
+        // Every input must be promotable: prefix_len <= target.
+        // Demoting (input > target) is not the adapter's contract.
+        for (i, split) in splits.iter().enumerate() {
+            assert!(
+                split.rg_partition_prefix_len <= target_prefix_len,
+                "promote_legacy: input {i} has rg_partition_prefix_len = {} > target_prefix_len = \
+                 {target_prefix_len}; the adapter cannot demote a higher prefix to a lower one. \
+                 Pick a target >= max(inputs' prefix_len) or exclude this input.",
+                split.rg_partition_prefix_len,
+            );
+        }
+        let kind = splits
+            .first()
+            .map(|s| s.kind)
+            .unwrap_or(ParquetSplitKind::Metrics);
+        Self {
+            merge_split_id: ParquetSplitId::generate(kind),
+            splits,
+            target_prefix_len_override: Some(target_prefix_len),
+        }
+    }
+
+    fn check_mp1_mp2_mp3(splits: &[ParquetSplitMetadata]) {
         use quickwit_dst::check_invariant;
         use quickwit_dst::invariants::{InvariantId, merge_policy};
 
@@ -90,15 +161,6 @@ impl ParquetMergeOperation {
             InvariantId::MP3,
             merge_policy::all_same_compaction_scope(&sort_fields_vec, &windows)
         );
-
-        let kind = splits
-            .first()
-            .map(|s| s.kind)
-            .unwrap_or(ParquetSplitKind::Metrics);
-        Self {
-            merge_split_id: ParquetSplitId::generate(kind),
-            splits,
-        }
     }
 
     /// Returns the input splits as a slice.
@@ -251,5 +313,69 @@ mod tests {
         ];
         let op = ParquetMergeOperation::new(splits);
         assert_eq!(op.splits.len(), 2);
+        assert!(
+            op.target_prefix_len_override.is_none(),
+            "regular merges don't set the override",
+        );
+    }
+
+    /// Legacy-promotion happy path: a prefix_len=0 split + a prefix_len=2
+    /// split with target=2. Both inputs share the windowed scope; the
+    /// operation records `target_prefix_len_override = Some(2)`.
+    #[test]
+    fn test_promote_legacy_pairs_legacy_with_aligned_peer() {
+        let mut legacy = make_split("legacy", 0, "metric_name|service|ts/V2", Some((0, 3600)));
+        legacy.rg_partition_prefix_len = 0;
+
+        let mut aligned = make_split("aligned", 0, "metric_name|service|ts/V2", Some((0, 3600)));
+        aligned.rg_partition_prefix_len = 2;
+
+        let op = ParquetMergeOperation::promote_legacy(vec![legacy, aligned], 2);
+        assert_eq!(op.splits.len(), 2);
+        assert_eq!(op.target_prefix_len_override, Some(2));
+    }
+
+    /// Promotion requires all inputs to have `prefix_len <= target`.
+    /// Passing an input whose prefix_len exceeds the target is a planner
+    /// bug — the adapter cannot DEMOTE alignment, only promote.
+    #[test]
+    #[should_panic(expected = "cannot demote a higher prefix")]
+    fn test_promote_legacy_rejects_higher_prefix_input() {
+        let mut legacy = make_split("legacy", 0, "a|b|ts/V2", Some((0, 3600)));
+        legacy.rg_partition_prefix_len = 0;
+
+        let mut too_high = make_split("too_high", 0, "a|b|ts/V2", Some((0, 3600)));
+        too_high.rg_partition_prefix_len = 3;
+
+        // target = 2, but too_high.rg_partition_prefix_len = 3.
+        ParquetMergeOperation::promote_legacy(vec![legacy, too_high], 2);
+    }
+
+    /// Promotion still requires MP-3 on the non-prefix scope
+    /// dimensions: sort_fields + window. Mixed sort_fields must still
+    /// panic.
+    #[test]
+    #[should_panic(expected = "MP-3 violated")]
+    fn test_promote_legacy_still_enforces_sort_fields() {
+        let mut a = make_split("a", 0, "metric_name|ts/V2", Some((0, 3600)));
+        a.rg_partition_prefix_len = 0;
+        let mut b = make_split("b", 0, "different|schema/V2", Some((0, 3600)));
+        b.rg_partition_prefix_len = 1;
+        ParquetMergeOperation::promote_legacy(vec![a, b], 1);
+    }
+
+    /// All inputs at the target prefix_len (no actual legacy promotion
+    /// happening) — the constructor still accepts it. The executor
+    /// will just open every input directly without the adapter.
+    /// Useful when a planner produces a uniform op that happens to be
+    /// at the same target.
+    #[test]
+    fn test_promote_legacy_all_at_target_is_valid() {
+        let mut a = make_split("a", 0, "a|ts/V2", Some((0, 3600)));
+        a.rg_partition_prefix_len = 1;
+        let mut b = make_split("b", 0, "a|ts/V2", Some((0, 3600)));
+        b.rg_partition_prefix_len = 1;
+        let op = ParquetMergeOperation::promote_legacy(vec![a, b], 1);
+        assert_eq!(op.target_prefix_len_override, Some(1));
     }
 }
diff --git a/quickwit/quickwit-parquet-engine/src/merge/schema.rs b/quickwit/quickwit-parquet-engine/src/merge/schema.rs
index 78c554d2cb3..085a673b463 100644
--- a/quickwit/quickwit-parquet-engine/src/merge/schema.rs
+++ b/quickwit/quickwit-parquet-engine/src/merge/schema.rs
@@ -59,45 +59,74 @@ pub fn align_inputs_to_union_schema(
         bail!("no inputs to align");
     }
 
-    // Collect all fields across all inputs, checking for type conflicts.
-    // String-like types are normalized to Utf8 for internal alignment.
-    let mut field_map: BTreeMap<String, Arc<Field>> = BTreeMap::new();
+    // Track each field's normalized type, whether any input declared
+    // it nullable, and how many of the input batches contain it. The
+    // union field is nullable iff some input observed it as nullable
+    // OR some input is missing the field entirely (a row from a
+    // missing-the-field input will be null in the merged output).
+    // The previous version always defaulted new fields to nullable on
+    // first sight, which broke columns whose nullability must be
+    // preserved (e.g. `List<Float64>` — the writer's non-nullable-
+    // list contract requires the union field to stay non-nullable).
+    struct FieldInfo {
+        normalized_type: DataType,
+        any_nullable: bool,
+        appears_in: usize,
+    }
+    let mut field_map: BTreeMap<String, FieldInfo> = BTreeMap::new();
 
     for (input_idx, batch) in inputs.iter().enumerate() {
         for field in batch.schema().fields() {
             let normalized_type = normalize_type(field.data_type());
 
-            match field_map.get(field.name().as_str()) {
+            match field_map.get_mut(field.name().as_str()) {
                 Some(existing) => {
-                    if *existing.data_type() != normalized_type {
+                    if existing.normalized_type != normalized_type {
                         bail!(
                             "type conflict for column '{}': input 0 has {:?}, input {} has {:?} \
                              (normalized: {:?} vs {:?})",
                             field.name(),
-                            existing.data_type(),
+                            existing.normalized_type,
                             input_idx,
                             field.data_type(),
-                            existing.data_type(),
+                            existing.normalized_type,
                             normalized_type,
                         );
                     }
-                    // If either side is nullable, the union must be too.
-                    if field.is_nullable() && !existing.is_nullable() {
-                        let nullable_field =
-                            Arc::new(Field::new(field.name(), normalized_type, true));
-                        field_map.insert(field.name().clone(), nullable_field);
+                    if field.is_nullable() {
+                        existing.any_nullable = true;
                     }
+                    existing.appears_in += 1;
                 }
                 None => {
-                    // Columns that don't appear in every input must be nullable.
-                    let nullable_field = Arc::new(Field::new(field.name(), normalized_type, true));
-                    field_map.insert(field.name().clone(), nullable_field);
+                    field_map.insert(
+                        field.name().clone(),
+                        FieldInfo {
+                            normalized_type,
+                            any_nullable: field.is_nullable(),
+                            appears_in: 1,
+                        },
+                    );
                 }
             }
         }
     }
 
-    // Build the union schema in Husky column order.
+    // Materialise `Arc<Field>` per the rule above. Keep
+    // `BTreeMap<String, Arc<Field>>` so `build_husky_ordered_schema`
+    // is unchanged.
+    let total_inputs = inputs.len();
+    let field_map: BTreeMap<String, Arc<Field>> = field_map
+        .into_iter()
+        .map(|(name, info)| {
+            let nullable = info.any_nullable || info.appears_in < total_inputs;
+            let field = Arc::new(Field::new(&name, info.normalized_type, nullable));
+            (name, field)
+        })
+        .collect();
+
+    // Build the union schema in storage column order (sort cols first,
+    // then body cols lexicographic).
     let union_schema = build_husky_ordered_schema(&field_map, sort_fields_str)?;
     let union_schema_ref = Arc::new(union_schema);
 
@@ -271,14 +300,23 @@ fn align_batch_to_schema(batch: &RecordBatch, target_schema: &SchemaRef) -> Resu
 /// Normalize an Arrow data type for the internal union schema.
 ///
 /// All string-like types (Utf8, LargeUtf8, Dictionary(*, Utf8/LargeUtf8))
-/// are normalized to Utf8. This ensures `take` works uniformly across
-/// concatenated inputs regardless of their original encoding.
+/// are normalized to Utf8. All byte-array-like types (Binary,
+/// LargeBinary, Dictionary(*, Binary/LargeBinary)) are normalized to
+/// Binary. Parquet stores both string flavours under the same `BYTE_ARRAY`
+/// physical type and both binary flavours likewise, so two inputs whose
+/// schemas differ only by string/binary flavour represent the same
+/// logical data; the union must accept them as one column.
 ///
-/// Non-string types are returned as-is.
+/// This ensures `take` works uniformly across concatenated inputs
+/// regardless of their original encoding. Non-string/non-binary types
+/// are returned as-is.
 fn normalize_type(dt: &DataType) -> DataType {
     if is_string_type(dt) {
         return DataType::Utf8;
     }
+    if is_byte_array_type(dt) {
+        return DataType::Binary;
+    }
     dt.clone()
 }
 
@@ -290,3 +328,17 @@ fn is_string_type(dt: &DataType) -> bool {
         _ => false,
     }
 }
+
+/// Returns true if the data type represents raw byte arrays.
+///
+/// Parquet has a single `BYTE_ARRAY` physical type; Binary and
+/// LargeBinary (and dict-encoded variants) all map to it on the
+/// wire. Schema evolution that changes one to the other across
+/// ingester versions must merge cleanly.
+fn is_byte_array_type(dt: &DataType) -> bool {
+    match dt {
+        DataType::Binary | DataType::LargeBinary => true,
+        DataType::Dictionary(_, value_type) => is_byte_array_type(value_type),
+        _ => false,
+    }
+}
diff --git a/quickwit/quickwit-parquet-engine/src/merge/streaming.rs b/quickwit/quickwit-parquet-engine/src/merge/streaming.rs
new file mode 100644
index 00000000000..2ce432725b3
--- /dev/null
+++ b/quickwit/quickwit-parquet-engine/src/merge/streaming.rs
@@ -0,0 +1,6149 @@
+// Copyright 2021-Present Datadog, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! Streaming column-major merge engine with page-bounded body cols.
+//!
+//! Architecture (multi-input → multi-output sorted merge with sort-cols-first
+//! column ordering):
+//!
+//! 1. **Phase 0 (async): drain sort cols** from each input. With the storage convention that sort
+//!    cols + `sorted_series` precede all body cols within each row group, we can stop the decoder
+//!    after those are fully decoded. The remaining body col pages stay un-read in the input stream,
+//!    ready for phase 3.
+//! 2. **Phase 1: compute merge order** via the existing k-way merge on `(sorted_series,
+//!    timestamp_secs)` from the per-input sort col [`RecordBatch`]es. Produces a run-length-encoded
+//!    merge plan over input row positions.
+//! 3. **Phase 2: compute output boundaries** with the caller's `num_outputs`, splitting at
+//!    `sorted_series` transitions so each output file's key range is non-overlapping with adjacent
+//!    files.
+//! 4. **Phase 3 (blocking + block_on bridges): streaming write**. All output writers are alive for
+//!    the duration. For each column in storage order, every output's col K is written in turn:
+//!    - Sort col / `sorted_series`: applied via `take` from the already-buffered phase 0 data.
+//!    - Body col: each output page is assembled via [`arrow::compute::interleave`] from input page
+//!      slices, with decoders advanced page-by-page via `Handle::block_on` from inside a sync
+//!      iterator. Pages flush to the writer's sink as [`SerializedColumnWriter`]'s page-size
+//!      threshold trips — memory stays bounded by the in-flight output page plus a small number of
+//!      in-flight input pages.
+//!
+//! After all M outputs' col K is done, every input decoder is at the
+//! start of col K+1 in its single row group. Move to col K+1.
+//!
+//! ## Single-RG inputs assumption
+//!
+//! PR-6b.2 only handles **single-row-group inputs**. With multi-RG
+//! inputs the body bytes interleave with successive RGs' sort cols
+//! (`sort_cols_RG0`, `body_cols_RG0`, `sort_cols_RG1`, ...), so
+//! draining sort cols from RG1 onwards requires either consuming +
+//! discarding body cols of RG0 from the stream or buffering them.
+//! Neither fits the page-bounded contract; multi-RG-input streaming
+//! lands in a follow-up. Today's real inputs are: (a) post-PR-3
+//! single-RG ingest splits, or (b) PR-5's legacy adapter that
+//! presents arbitrary multi-RG splits as one synthetic RG. Both
+//! satisfy the assumption.
+//!
+//! [`SerializedColumnWriter`]: parquet::file::writer::SerializedColumnWriter
+
+#![allow(dead_code)]
+
+use std::collections::{HashMap, HashSet};
+use std::ops::Range;
+use std::path::Path;
+use std::sync::Arc;
+
+use anyhow::{Context, Result, bail};
+use arrow::array::{Array, ArrayRef, RecordBatch, new_null_array};
+use arrow::datatypes::{Field, Schema as ArrowSchema, SchemaRef};
+use parquet::file::metadata::ParquetMetaData;
+use tokio::runtime::Handle;
+use tracing::info;
+
+use super::merge_order::{MergeRun, compute_merge_order};
+use super::schema::align_inputs_to_union_schema;
+use super::writer::{apply_merge_permutation, verify_sort_order};
+use super::{InputMetadata, MergeConfig, MergeOutputFile};
+
+mod body_assembler;
+mod output;
+pub(crate) mod region_grouping;
+
+use body_assembler::{BodyColOutputPageAssembler, StreamingBodyColIter};
+use output::{
+    OutputAccumulator, OutputWriterStorage, finalize_output, open_output_writer_for_streaming,
+};
+use region_grouping::{
+    Region, extract_regions_from_metadata, split_region_at_sorted_series,
+    validate_region_order_matches_physical_rg_order,
+};
+
+use crate::sort_fields::{
+    equivalent_schemas_for_compaction, is_timestamp_column_name, parse_sort_fields,
+};
+use crate::sorted_series::SORTED_SERIES_COLUMN;
+use crate::storage::page_decoder::{DecodedPage, StreamDecoder};
+use crate::storage::{
+    ColumnPageStream, PARQUET_META_NUM_MERGE_OPS, PARQUET_META_RG_PARTITION_PREFIX_LEN,
+    PARQUET_META_SORT_FIELDS, PARQUET_META_WINDOW_DURATION, PARQUET_META_WINDOW_START,
+};
+
+/// Output page size in rows for body-col assembly. Each call to the
+/// sync iterator passed to [`write_next_column_arrays`] yields one
+/// `ArrayRef` of up to this many rows; the parquet writer flushes
+/// physical pages independently as encoded bytes cross
+/// `data_page_size_limit`. 1024 keeps assembled arrays small enough
+/// to bound per-output memory but large enough to amortise per-page
+/// fixed costs.
+///
+/// [`write_next_column_arrays`]: crate::storage::streaming_writer::RowGroupBuilder::write_next_column_arrays
+pub(crate) const OUTPUT_PAGE_ROWS: usize = 1024;
+
+/// Peak observed length of any input's `body_col_page_cache` since the
+/// last reset, set unconditionally inside the merge so the invariant is
+/// observable in any test build. Used by the MS-7 page-bounded-memory
+/// test to assert that the cache stays bounded by a small constant
+/// regardless of input column size, and by cross-crate integration tests
+/// that need to confirm the streaming engine ran (any non-zero value
+/// proves a streaming merge executed).
+#[cfg(any(test, feature = "testsuite"))]
+pub static PEAK_BODY_COL_PAGE_CACHE_LEN: std::sync::atomic::AtomicUsize =
+    std::sync::atomic::AtomicUsize::new(0);
+
+#[cfg(any(test, feature = "testsuite"))]
+pub(crate) fn record_body_col_page_cache_len(len: usize) {
+    use std::sync::atomic::Ordering;
+    let mut prev = PEAK_BODY_COL_PAGE_CACHE_LEN.load(Ordering::Relaxed);
+    while len > prev {
+        match PEAK_BODY_COL_PAGE_CACHE_LEN.compare_exchange_weak(
+            prev,
+            len,
+            Ordering::Relaxed,
+            Ordering::Relaxed,
+        ) {
+            Ok(_) => break,
+            Err(observed) => prev = observed,
+        }
+    }
+}
+
+#[cfg(not(any(test, feature = "testsuite")))]
+pub(crate) fn record_body_col_page_cache_len(_len: usize) {}
+
+/// Streaming N-input → M-output column-major merge.
+///
+/// See module docs for the four phases. Returns one
+/// [`MergeOutputFile`] per output file produced (zero-row outputs are
+/// dropped). Caller's `config.num_outputs` is the upper bound on the
+/// number of files; fewer are returned when there are not enough
+/// `sorted_series` transitions to split at.
+pub async fn streaming_merge_sorted_parquet_files(
+    inputs: Vec<Box<dyn ColumnPageStream>>,
+    output_dir: &Path,
+    config: &MergeConfig,
+) -> Result<Vec<MergeOutputFile>> {
+    if inputs.is_empty() {
+        bail!("merge requires at least one input");
+    }
+    if config.num_outputs == 0 {
+        bail!("num_outputs must be at least 1");
+    }
+
+    let input_meta = extract_and_validate_input_metadata(&inputs)?;
+
+    // Reject legacy multi-RG inputs (`rg_partition_prefix_len == 0`
+    // AND any input has >1 row group). These have no alignment claim,
+    // so RG boundaries are arbitrary row counts that may split a
+    // single sort-key value across two RGs. The streaming engine
+    // cannot determine merge regions without column-chunk-bounded
+    // buffering; such inputs must go through `LegacyInputAdapter`
+    // (from PR-5, see `storage::legacy_adapter`), which presents
+    // them as one synthetic single-RG stream.
+    //
+    // This guard catches caller bugs — production code always routes
+    // legacy splits through the adapter (see `merge::execute_merge_operation`
+    // in `merge/mod.rs`), so a raw legacy `StreamingParquetReader`
+    // arriving here is a wiring mistake, not a supported input shape.
+    // Bail with a clear pointer rather than wading further into the
+    // streaming pipeline with mis-aligned RGs.
+    if input_meta.rg_partition_prefix_len == 0 {
+        for (idx, stream) in inputs.iter().enumerate() {
+            let num_rgs = stream.metadata().num_row_groups();
+            if num_rgs > 1 {
+                bail!(
+                    "legacy multi-RG inputs (rg_partition_prefix_len=0) must go through the PR-5 \
+                     adapter — input {idx} has {num_rgs} row groups with no alignment claim"
+                );
+            }
+        }
+    }
+
+    info!(
+        num_inputs = inputs.len(),
+        num_outputs = config.num_outputs,
+        sort_fields = %input_meta.sort_fields,
+        "starting streaming sorted parquet merge"
+    );
+
+    let output_dir = output_dir.to_path_buf();
+    let writer_config = config.writer_config.clone();
+    let num_outputs = config.num_outputs;
+
+    // Move everything onto a blocking task. Inside, the decoders need to
+    // make async I/O calls (page fetches over the network); we drive
+    // those via `handle.block_on(...)` from inside sync iterators that
+    // feed the parquet writer's column-write methods. The writer is
+    // sync; this single-task pattern avoids the lifetime complexity of
+    // moving borrowed `RowGroupBuilder`s across tokio tasks.
+    let result = tokio::task::spawn_blocking(move || -> Result<Vec<MergeOutputFile>> {
+        let handle = Handle::current();
+
+        let mut inputs = inputs;
+        let mut decoders_state = build_input_decoders_state(&mut inputs)?;
+
+        // Cross-input precondition for the body-col memory bound. See
+        // the function doc and the `body_col_page_caches` field doc for
+        // the load-bearing argument.
+        assert_inputs_in_husky_body_col_order(&decoders_state, &input_meta.sort_fields)?;
+
+        // Pre-compute regions from RG metadata. With prefix_len >= 1
+        // each region is one sort-prefix value across inputs (each
+        // contributing input has exactly one RG in that region). With
+        // prefix_len == 0 (validated single-RG above) there is one
+        // region covering all inputs — we'll subdivide it below by
+        // walking the merge order to find first-sort-col transitions.
+        let regions = extract_regions_from_metadata(&decoders_state, &input_meta)?;
+
+        if regions.is_empty() {
+            info!("all inputs empty, producing no output");
+            return Ok(Vec::new());
+        }
+
+        // MS-2: validate that the BTreeMap-driven region order
+        // agrees with each input's physical RG order. The streaming
+        // engine reads inputs sequentially — it cannot rewind. If
+        // region K's contributing RG for input I is physically AFTER
+        // a later region L's contributing RG for the same input, the
+        // engine would bail mid-merge with "page from rg X while
+        // draining rg Y" (Err returned up the spawn_blocking task,
+        // not a panic). This typically means the input was sorted
+        // in the opposite direction from what the sort schema
+        // declares (e.g., metric_name written DESC on disk but the
+        // sort schema says ASC). Reject upfront with a clearer
+        // error that names the offending input and region.
+        validate_region_order_matches_physical_rg_order(&regions, decoders_state.len())?;
+
+        let total_rows: usize = regions.iter().map(|r| r.total_rows()).sum();
+        let target_per_output = (total_rows.div_ceil(num_outputs)).max(1);
+
+        info!(
+            total_rows,
+            num_regions = regions.len(),
+            num_outputs,
+            "streaming merge regions computed"
+        );
+
+        // Build the union schema once across all inputs' arrow schemas.
+        let arrow_schemas: Vec<SchemaRef> = decoders_state
+            .iter()
+            .map(|s| Arc::clone(&s.arrow_schema))
+            .collect();
+        let union_schema =
+            build_full_union_schema_from_arrow_schemas(&arrow_schemas, &input_meta.sort_fields)?;
+
+        // Region processing loop. For each top-level region we may
+        // need to subdivide it across multiple output files so we
+        // honor `num_outputs` even when the input layout doesn't
+        // give us enough region boundaries (e.g., one giant
+        // `metric_name` with prefix_len=0). Splitting happens at
+        // `sorted_series` transitions inside the region's merge
+        // order — never inside a single sorted_series run.
+        //
+        // Memory: between top-level regions we reset every input's
+        // per-col page cache + cursor, because pages from one RG
+        // would have row_start values that collide with the next
+        // RG's row-index space. Sub-regions of one top-level region
+        // share an RG, so the cache survives across sub-region
+        // boundaries — that's what lets the col write loop re-read
+        // an earlier col's pages for a later sub-region.
+        let mut outputs: Vec<MergeOutputFile> = Vec::new();
+        let mut current_writer: Option<OutputWriterStorage> = None;
+        let mut current_accumulator: Option<OutputAccumulator> = None;
+        let mut current_output_idx: usize = 0;
+        let mut current_output_rows: usize = 0;
+
+        for (region_idx, region) in regions.iter().enumerate() {
+            if region_idx > 0 {
+                for state in decoders_state.iter_mut() {
+                    state.reset_all_body_col_state();
+                }
+            }
+
+            // Decide whether we need to split this region across
+            // multiple outputs. Two conditions must both hold: the
+            // region's rows would push the current output past the
+            // target AND there is an unused output to advance to.
+            // When splitting kicks in we have to pre-drain this
+            // region's sort cols so we can compute the merge order
+            // and find sorted_series transitions; if no split is
+            // needed we let `process_region` drain internally to
+            // preserve the existing per-region memory bound.
+            //
+            // If the current output is already at-or-past target,
+            // the sub-region loop below will roll over to a fresh
+            // output BEFORE writing this region's first sub-region.
+            // The split decision must be made against that fresh
+            // output's full budget — not the current (zero) remainder
+            // — otherwise `split_region_at_sorted_series` would cut
+            // after the very first sorted_series run, and the small
+            // leftover plus the large continuation (both inheriting
+            // the region's prefix key) would land in the same new
+            // output, tripping the PA-3 duplicate-prefix-RG check in
+            // `finalize_output`. Same for `outputs_remaining`: after
+            // the rollover one fewer output is left to fill.
+            let outputs_remaining_raw = num_outputs - current_output_idx;
+            let remaining_in_current_raw = target_per_output.saturating_sub(current_output_rows);
+            let will_roll_over = current_writer.is_some()
+                && current_output_rows >= target_per_output
+                && current_output_idx + 1 < num_outputs;
+            let effective_first_target = if will_roll_over {
+                target_per_output
+            } else {
+                remaining_in_current_raw
+            };
+            let effective_outputs_remaining = if will_roll_over {
+                outputs_remaining_raw.saturating_sub(1)
+            } else {
+                outputs_remaining_raw
+            };
+            let needs_split =
+                effective_outputs_remaining > 1 && region.total_rows() > effective_first_target;
+
+            let prefetched: Option<Vec<RecordBatch>> = if needs_split {
+                Some(drain_and_align_region(
+                    &handle,
+                    &mut decoders_state,
+                    region,
+                    &input_meta.sort_fields,
+                )?)
+            } else {
+                None
+            };
+
+            let sub_regions: Vec<Region> = if let Some(ref prefetched_batches) = prefetched {
+                let merge_order = compute_merge_order(prefetched_batches, &input_meta.sort_fields)?;
+                split_region_at_sorted_series(
+                    region,
+                    &merge_order,
+                    prefetched_batches,
+                    effective_first_target,
+                    target_per_output,
+                    effective_outputs_remaining,
+                )?
+            } else {
+                vec![region.clone()]
+            };
+
+            if needs_split {
+                // Multi-output sub-region path: hoist the col loop OUT
+                // of the sub-region loop via `process_split_region_col_outer`
+                // so each parquet col chunk is fully consumed across
+                // all sub-regions before the next col starts (F14).
+                //
+                // Group consecutive sub-regions that share an output
+                // writer into one "writer chunk" — each chunk writes
+                // one RG containing all its sub-regions' rows. The
+                // group boundaries are where the running row count
+                // crosses `target_per_output` and there's still
+                // budget for another output.
+                let prefetched_batches = prefetched
+                    .as_deref()
+                    .expect("needs_split path always drains prefetched sort batches");
+
+                // Build writer chunks: pre-decide writer transitions
+                // for this region's sub-regions and coalesce
+                // consecutive sub-regions assigned to the same
+                // writer into one Region. Each writer-chunk needs a
+                // distinct OutputWriterStorage (one RG per chunk;
+                // RGs must be sequential per writer so two
+                // sub-regions on the same writer would need
+                // sequential RGs — but we coalesce them into one
+                // RG to keep the col-outer loop simple).
+                let mut chunk_assignments: Vec<(usize, Vec<&Region>)> = Vec::new();
+                // Pre-rollover initialization. If `will_roll_over` is set
+                // (the previous region already filled `current_output_idx`
+                // and we still have outputs to fill), the first chunk must
+                // land in the NEXT output — not append to the already-full
+                // one. The inner `needs_new_writer` check below guards on
+                // `!chunk_assignments.is_empty()`, so it only fires from
+                // the second iteration on; the first chunk's destination
+                // has to be decided here.
+                //
+                // Companion fix to the `effective_first_target` /
+                // `effective_outputs_remaining` rollover handling above:
+                // the split *decision* uses the rolled-over output's
+                // budget, and the chunk *assignment* must too. Otherwise
+                // the first sub-region would be glued onto the already-
+                // full output, overfilling it by up to `target_per_output`
+                // rows and shrinking the final output count.
+                let (mut active_output_idx, mut active_rows) = if will_roll_over {
+                    (current_output_idx + 1, 0)
+                } else {
+                    (current_output_idx, current_output_rows)
+                };
+                for sub_region in &sub_regions {
+                    let sub_rows = sub_region.total_rows();
+                    let needs_new_writer = !chunk_assignments.is_empty()
+                        && active_rows >= target_per_output
+                        && active_output_idx + 1 < num_outputs;
+                    if needs_new_writer {
+                        active_output_idx += 1;
+                        active_rows = 0;
+                    }
+                    let same_writer_as_last = match chunk_assignments.last() {
+                        Some((o, _)) => *o == active_output_idx,
+                        None => false,
+                    };
+                    if same_writer_as_last {
+                        chunk_assignments
+                            .last_mut()
+                            .expect("non-empty checked above")
+                            .1
+                            .push(sub_region);
+                    } else {
+                        chunk_assignments.push((active_output_idx, vec![sub_region]));
+                    }
+                    active_rows += sub_rows;
+                }
+
+                // Open writers for any new output indices in this
+                // region's chunk assignments. The first chunk reuses
+                // the existing current_writer if it matches.
+                // Materialize concrete writer/accumulator storage in
+                // a Vec so we can take &mut to each.
+                struct ChunkStorage {
+                    writer: OutputWriterStorage,
+                    accumulator: OutputAccumulator,
+                    region: Region,
+                }
+                let mut chunk_storage: Vec<ChunkStorage> =
+                    Vec::with_capacity(chunk_assignments.len());
+                for (idx_in_region, (output_idx, sub_regions_for_chunk)) in
+                    chunk_assignments.iter().enumerate()
+                {
+                    // First chunk: try to reuse current_writer / current_accumulator
+                    // if they match the output_idx; otherwise finalize them and open new.
+                    let can_reuse_current = idx_in_region == 0
+                        && match current_writer.as_ref() {
+                            Some(w) => w.output_idx == *output_idx,
+                            None => false,
+                        };
+                    let (writer, accumulator) = if can_reuse_current {
+                        (
+                            current_writer.take().expect("matched above"),
+                            current_accumulator.take().expect("matched above"),
+                        )
+                    } else {
+                        // Need to advance: finalize current if it's
+                        // not for this output, then open a fresh
+                        // writer at output_idx.
+                        if let (Some(w), Some(acc)) =
+                            (current_writer.take(), current_accumulator.take())
+                        {
+                            outputs.push(finalize_output(w, acc, &input_meta)?);
+                        }
+                        let writer = open_output_writer_for_streaming(
+                            *output_idx,
+                            &output_dir,
+                            &union_schema,
+                            &input_meta,
+                            &writer_config,
+                        )?;
+                        (writer, OutputAccumulator::new(*output_idx))
+                    };
+
+                    // Coalesce sub-regions into one Region by summing
+                    // contributions per (input_idx, rg_idx).
+                    let mut by_input: std::collections::BTreeMap<(usize, usize), (usize, usize)> =
+                        std::collections::BTreeMap::new();
+                    for sr in sub_regions_for_chunk {
+                        for c in &sr.contributing {
+                            by_input
+                                .entry((c.input_idx, c.rg_idx))
+                                .and_modify(|(min_start, total)| {
+                                    *min_start = (*min_start).min(c.start_row);
+                                    *total += c.num_rows;
+                                })
+                                .or_insert((c.start_row, c.num_rows));
+                        }
+                    }
+                    let contributing: Vec<region_grouping::RegionContribution> = by_input
+                        .into_iter()
+                        .map(|((input_idx, rg_idx), (start_row, num_rows))| {
+                            region_grouping::RegionContribution {
+                                input_idx,
+                                rg_idx,
+                                start_row,
+                                num_rows,
+                            }
+                        })
+                        .collect();
+                    let combined = Region {
+                        prefix_key: sub_regions_for_chunk[0].prefix_key.clone(),
+                        contributing,
+                    };
+                    chunk_storage.push(ChunkStorage {
+                        writer,
+                        accumulator,
+                        region: combined,
+                    });
+                }
+
+                // Build &mut chunks for the col-outer processor.
+                let chunks: Vec<SplitRegionChunk<'_>> = chunk_storage
+                    .iter_mut()
+                    .map(|s| SplitRegionChunk {
+                        writer_state: &mut s.writer,
+                        accumulator: &mut s.accumulator,
+                        region: s.region.clone(),
+                    })
+                    .collect();
+                process_split_region_col_outer(
+                    &handle,
+                    &mut decoders_state,
+                    chunks,
+                    &union_schema,
+                    &input_meta,
+                    prefetched_batches,
+                )?;
+
+                // Hand chunk storage back to main-loop state: the
+                // last chunk becomes the new current writer; all
+                // earlier chunks finalize now.
+                let last_chunk_storage = chunk_storage
+                    .pop()
+                    .expect("chunk_assignments is non-empty so storage is non-empty");
+                for finished in chunk_storage {
+                    outputs.push(finalize_output(
+                        finished.writer,
+                        finished.accumulator,
+                        &input_meta,
+                    )?);
+                }
+                current_writer = Some(last_chunk_storage.writer);
+                current_accumulator = Some(last_chunk_storage.accumulator);
+                current_output_idx = active_output_idx;
+                current_output_rows = active_rows;
+            } else {
+                // No split needed: ensure a writer is open at the
+                // current output index, then process the single
+                // region via the existing per-region path.
+                if current_writer.is_some()
+                    && current_output_rows >= target_per_output
+                    && current_output_idx + 1 < num_outputs
+                {
+                    if let (Some(w), Some(acc)) =
+                        (current_writer.take(), current_accumulator.take())
+                    {
+                        outputs.push(finalize_output(w, acc, &input_meta)?);
+                    }
+                    current_output_idx += 1;
+                    current_output_rows = 0;
+                }
+
+                if current_writer.is_none() {
+                    let writer = open_output_writer_for_streaming(
+                        current_output_idx,
+                        &output_dir,
+                        &union_schema,
+                        &input_meta,
+                        &writer_config,
+                    )?;
+                    current_writer = Some(writer);
+                    current_accumulator = Some(OutputAccumulator::new(current_output_idx));
+                }
+
+                process_region(
+                    &handle,
+                    &mut decoders_state,
+                    current_writer
+                        .as_mut()
+                        .expect("writer opened above for this region"),
+                    current_accumulator
+                        .as_mut()
+                        .expect("accumulator opened above for this region"),
+                    region,
+                    &union_schema,
+                    &input_meta,
+                    prefetched.as_deref(),
+                )?;
+                current_output_rows += region.total_rows();
+            }
+        }
+
+        // Close the last writer.
+        if let (Some(w), Some(acc)) = (current_writer.take(), current_accumulator.take()) {
+            outputs.push(finalize_output(w, acc, &input_meta)?);
+        }
+
+        // MC-1: total row count preserved.
+        let output_total: usize = outputs.iter().map(|o| o.num_rows).sum();
+        quickwit_dst::check_invariant!(
+            quickwit_dst::invariants::InvariantId::MC1,
+            output_total == total_rows,
+            ": streaming merge input rows={}, output rows={}",
+            total_rows,
+            output_total,
+        );
+
+        Ok(outputs)
+    })
+    .await
+    .context("streaming merge blocking task panicked")??;
+
+    Ok(result)
+}
+
+/// Per-input state held across phase 0 → phase 3 inside the blocking
+/// task. The decoder owns its stream so it persists across phases and
+/// across all output writes for a given input — critical for body
+/// columns whose pages may need to be visited multiple times (one page
+/// can supply rows for more than one output, or one output may need
+/// rows from more than one page). Reconstructing the decoder mid-pass
+/// would reset the per-column `rows_decoded` counter (so `row_start`
+/// becomes wrong) and discard cached dictionary / queued pages.
+pub(crate) struct InputDecoderState {
+    pub(crate) decoder: StreamDecoder<'static>,
+    pub(crate) metadata: Arc<ParquetMetaData>,
+    /// Arrow schema of this input (from parquet → arrow conversion).
+    pub(crate) arrow_schema: SchemaRef,
+    /// Per-input, per-parquet-col page cache. Pages emitted by the
+    /// decoder are stored under their actual `col_idx`, not the col
+    /// the caller was asking about — this lets `fill_page_cache_to_row`
+    /// pull pages of col B during a col A advance (storage order:
+    /// col A pages all stream before col B pages within an RG, so to
+    /// get any col B pages we may have to consume leftover col A
+    /// pages first). The synthesized-prefix path relies on this:
+    /// adjacent regions sharing one RG re-read col A from this cache
+    /// when region 2 starts, and the cache has already been filled in
+    /// by region 1's reads.
+    ///
+    /// Pages are evicted (per col) when the col's cursor advances
+    /// past their last row.
+    ///
+    /// **Memory bound (horizontal, not vertical).** This cache is
+    /// per-input: total memory across inputs is
+    /// `N_inputs × per_input_peak`. The per-input peak is bounded by
+    /// `ceil(OUTPUT_PAGE_ROWS / input_page_rows) + small_slack` —
+    /// driven by `fill_page_cache_to_row` (in `body_assembler`), which
+    /// fetches only enough pages to cover the rows from *this* input
+    /// that contribute to the *current output page* (a 1024-row
+    /// slice), and by the eviction loop in `assemble_one_output_page`,
+    /// which drops any page whose last row falls below the cursor
+    /// after each output page emits. The cache never holds an input
+    /// column-chunk's worth of pages — a regression that ever did
+    /// would break the MS-7 invariant asserted by
+    /// `test_ms7_body_col_page_cache_bounded_regardless_of_input_size`.
+    pub(crate) body_col_page_caches: HashMap<usize, Vec<DecodedPage>>,
+    /// Per-parquet-col cursor: the next unconsumed input row for that
+    /// col. Advances monotonically as outputs are written; survives
+    /// across regions of the same input so the body col assembler can
+    /// resume mid-RG for synthesized regions.
+    pub(crate) body_col_cursors: HashMap<usize, usize>,
+}
+
+impl InputDecoderState {
+    /// Position the cursor for `col_idx` to `start_row` and drop any
+    /// cached pages strictly below it. Used at the start of each
+    /// (region, body col) write: for whole-RG regions this is a no-op
+    /// (cursor was already 0 and cache empty); for synthesized
+    /// regions that share an RG with earlier regions, this skips the
+    /// rows the previous region consumed without clearing the rest of
+    /// the cache.
+    pub(crate) fn set_body_col_cursor(&mut self, col_idx: usize, start_row: usize) {
+        self.body_col_cursors.insert(col_idx, start_row);
+        if let Some(pages) = self.body_col_page_caches.get_mut(&col_idx) {
+            while let Some(front) = pages.first() {
+                let front_end = front.row_start + front.array.len();
+                if front_end <= start_row {
+                    pages.remove(0);
+                } else {
+                    break;
+                }
+            }
+        }
+    }
+
+    /// Look up the cached pages for the given col, returning an empty
+    /// slice if none have been decoded yet.
+    pub(crate) fn body_col_cache(&self, col_idx: usize) -> &[DecodedPage] {
+        match self.body_col_page_caches.get(&col_idx) {
+            Some(v) => v.as_slice(),
+            None => &[],
+        }
+    }
+
+    /// Mutable handle to the cache for `col_idx`, creating an empty
+    /// entry if none exists.
+    pub(crate) fn body_col_cache_mut(&mut self, col_idx: usize) -> &mut Vec<DecodedPage> {
+        self.body_col_page_caches.entry(col_idx).or_default()
+    }
+
+    /// Current cursor for `col_idx`; defaults to 0 if untouched.
+    pub(crate) fn body_col_cursor(&self, col_idx: usize) -> usize {
+        self.body_col_cursors.get(&col_idx).copied().unwrap_or(0)
+    }
+
+    /// Sum of cached pages across all cols. Used by the MS-7 peak-
+    /// length probe in tests.
+    pub(crate) fn body_col_caches_total_len(&self) -> usize {
+        self.body_col_page_caches.values().map(|v| v.len()).sum()
+    }
+
+    /// Clear every per-col page cache and every per-col cursor.
+    /// Called between top-level regions: each region typically uses a
+    /// different input RG, and the per-col page `row_start` values
+    /// reported by the decoder are RG-local, so pages cached for RG K
+    /// would conflict with RG K+1's row-index space. Sub-regions of a
+    /// single top-level region share an RG and MUST NOT trigger this
+    /// reset — they rely on the cache surviving from one sub-region
+    /// to the next so an earlier-col read whose stream-tail spans
+    /// later sub-regions stays available.
+    pub(crate) fn reset_all_body_col_state(&mut self) {
+        self.body_col_page_caches.clear();
+        self.body_col_cursors.clear();
+    }
+}
+
+/// Build per-input state. The streams are moved in from the caller and
+/// installed in long-lived [`StreamDecoder`]s so per-column state
+/// survives every phase of the merge.
+fn build_input_decoders_state(
+    inputs: &mut Vec<Box<dyn ColumnPageStream>>,
+) -> Result<Vec<InputDecoderState>> {
+    let mut states = Vec::with_capacity(inputs.len());
+    for stream in inputs.drain(..) {
+        let metadata = Arc::clone(stream.metadata());
+        let parquet_schema = metadata.file_metadata().schema_descr();
+        let arrow_schema = parquet::arrow::parquet_to_arrow_schema(parquet_schema, None)
+            .context("converting parquet schema → arrow")?;
+        let decoder = StreamDecoder::from_owned(stream);
+        states.push(InputDecoderState {
+            decoder,
+            metadata,
+            arrow_schema: Arc::new(arrow_schema),
+            body_col_page_caches: HashMap::new(),
+            body_col_cursors: HashMap::new(),
+        });
+    }
+    Ok(states)
+}
+
+/// Extract sort schema, window, and merge-ops metadata from each
+/// input stream and validate consistency across inputs. Reads
+/// `qh.*` KVs from [`ColumnPageStream::metadata`].
+fn extract_and_validate_input_metadata(
+    inputs: &[Box<dyn ColumnPageStream>],
+) -> Result<InputMetadata> {
+    let mut consensus_sort_fields: Option<String> = None;
+    let mut consensus_window_start: Option<Option<i64>> = None;
+    let mut consensus_window_duration: Option<u32> = None;
+    let mut consensus_prefix_len: Option<u32> = None;
+    let mut max_merge_ops: u32 = 0;
+
+    for (idx, stream) in inputs.iter().enumerate() {
+        let metadata = stream.metadata();
+        let kv_metadata = metadata.file_metadata().key_value_metadata();
+
+        let find_kv = |key: &str| -> Option<String> {
+            kv_metadata.and_then(|kvs| {
+                kvs.iter()
+                    .find(|kv| kv.key == key)
+                    .and_then(|kv| kv.value.clone())
+            })
+        };
+
+        let file_sort_fields = match find_kv(PARQUET_META_SORT_FIELDS) {
+            Some(s) => s,
+            None => bail!(
+                "input {idx} is missing {} metadata",
+                PARQUET_META_SORT_FIELDS,
+            ),
+        };
+
+        match &consensus_sort_fields {
+            Some(expected) => {
+                let expected_schema = parse_sort_fields(expected)?;
+                let file_schema = parse_sort_fields(&file_sort_fields).with_context(|| {
+                    format!("parsing sort schema from input {idx}: '{file_sort_fields}'")
+                })?;
+                if !equivalent_schemas_for_compaction(&expected_schema, &file_schema) {
+                    bail!(
+                        "sort schema mismatch in input {idx}: expected '{expected}', found \
+                         '{file_sort_fields}'",
+                    );
+                }
+            }
+            None => {
+                parse_sort_fields(&file_sort_fields).with_context(|| {
+                    format!("parsing sort schema from input {idx}: '{file_sort_fields}'")
+                })?;
+                consensus_sort_fields = Some(file_sort_fields.clone());
+            }
+        }
+
+        let file_window_start = find_kv(PARQUET_META_WINDOW_START)
+            .map(|s| s.parse::<i64>())
+            .transpose()
+            .with_context(|| format!("parsing window_start from input {idx}"))?;
+        match &consensus_window_start {
+            Some(expected) if file_window_start != *expected => {
+                bail!(
+                    "window_start mismatch in input {idx}: expected {:?}, found {:?}",
+                    expected,
+                    file_window_start,
+                );
+            }
+            Some(_) => {}
+            None => consensus_window_start = Some(file_window_start),
+        }
+
+        let file_window_duration = find_kv(PARQUET_META_WINDOW_DURATION)
+            .map(|s| s.parse::<u32>())
+            .transpose()
+            .with_context(|| format!("parsing window_duration from input {idx}"))?
+            .unwrap_or(0);
+        match &consensus_window_duration {
+            Some(expected) if file_window_duration != *expected => {
+                bail!(
+                    "window_duration_secs mismatch in input {idx}: expected {expected}, found \
+                     {file_window_duration}",
+                );
+            }
+            Some(_) => {}
+            None => consensus_window_duration = Some(file_window_duration),
+        }
+
+        let file_merge_ops = find_kv(PARQUET_META_NUM_MERGE_OPS)
+            .map(|s| s.parse::<u32>())
+            .transpose()
+            .with_context(|| format!("parsing num_merge_ops from input {idx}"))?
+            .unwrap_or(0);
+        if file_merge_ops > max_merge_ops {
+            max_merge_ops = file_merge_ops;
+        }
+
+        let file_prefix_len = find_kv(PARQUET_META_RG_PARTITION_PREFIX_LEN)
+            .map(|s| s.parse::<u32>())
+            .transpose()
+            .with_context(|| format!("parsing rg_partition_prefix_len from input {idx}"))?
+            .unwrap_or(0);
+        match &consensus_prefix_len {
+            Some(expected) if file_prefix_len != *expected => {
+                bail!(
+                    "rg_partition_prefix_len mismatch in input {idx}: expected {expected}, found \
+                     {file_prefix_len}",
+                );
+            }
+            Some(_) => {}
+            None => consensus_prefix_len = Some(file_prefix_len),
+        }
+    }
+
+    let sort_fields = match consensus_sort_fields {
+        Some(s) => s,
+        None => bail!("at least one input is required"),
+    };
+
+    // `rg_partition_prefix_len` is intentionally optional in the data
+    // model. Splits written before the prefix-aligned-RG era (and any
+    // split not written by the streaming engine) simply lack the KV.
+    // Inputs that *do* declare a value all had to agree above (else we
+    // already bailed), so falling through to 0 here means "none of the
+    // inputs claimed a prefix" rather than "we lost the value." The
+    // legacy-promotion path in PR-6423 handles mixing 0 with non-zero
+    // prefixes via the `mixed_prefix_ok` escape hatch.
+    Ok(InputMetadata {
+        sort_fields,
+        window_start_secs: consensus_window_start.unwrap_or(None),
+        window_duration_secs: consensus_window_duration.unwrap_or(0),
+        num_merge_ops: max_merge_ops + 1,
+        rg_partition_prefix_len: consensus_prefix_len.unwrap_or(0),
+    })
+}
+
+/// Drain a region's contributing inputs' sort cols and align them to
+/// the union sort schema. The result has one entry per global input
+/// index (zero-row placeholders for non-contributing inputs), which
+/// is what `split_region_at_sorted_series` and `process_region` both
+/// expect.
+///
+/// Used by the main loop when a region needs to be sub-divided
+/// across multiple output files — splitting needs the merge order,
+/// which needs the drained sort cols. For regions that fit in a
+/// single output we skip this and let `process_region` drain
+/// internally so the per-region memory cost stays bounded by what
+/// the writer actively consumes.
+fn drain_and_align_region(
+    handle: &Handle,
+    decoders_state: &mut [InputDecoderState],
+    region: &Region,
+    sort_fields_str: &str,
+) -> Result<Vec<RecordBatch>> {
+    let num_inputs = decoders_state.len();
+    let mut sort_col_batches: Vec<Option<RecordBatch>> = (0..num_inputs).map(|_| None).collect();
+    for c in &region.contributing {
+        let batch = drain_sort_cols_one_input(
+            handle,
+            &mut decoders_state[c.input_idx],
+            sort_fields_str,
+            c.input_idx,
+            c.rg_idx,
+        )?;
+        if batch.num_columns() > 0 && batch.schema().index_of(SORTED_SERIES_COLUMN).is_err() {
+            bail!(
+                "input {} rg {} is missing the '{}' column required for merge",
+                c.input_idx,
+                c.rg_idx,
+                SORTED_SERIES_COLUMN,
+            );
+        }
+        sort_col_batches[c.input_idx] = Some(batch);
+    }
+
+    let mut sort_batch_vec: Vec<RecordBatch> = Vec::with_capacity(num_inputs);
+    for (idx, slot) in sort_col_batches.into_iter().enumerate() {
+        let batch = match slot {
+            Some(b) => b,
+            None => empty_sort_col_record_batch(&decoders_state[idx], sort_fields_str)?,
+        };
+        sort_batch_vec.push(batch);
+    }
+
+    let (_sort_union_schema, aligned_sort_batches) =
+        align_inputs_to_union_schema(&sort_batch_vec, sort_fields_str)?;
+    Ok(aligned_sort_batches)
+}
+
+// ============================================================================
+// Phase 0: drain sort cols from each input
+// ============================================================================
+
+/// Drive each input's decoder via `block_on` until its sort cols +
+/// `sorted_series` are fully decoded for the (single) row group.
+/// Returns one [`RecordBatch`] per input with just those columns; the
+/// rest of each input's body bytes stay un-read in the stream, ready
+/// for phase 3 to consume page-by-page.
+fn drain_sort_cols_all_inputs(
+    handle: &Handle,
+    decoders_state: &mut [InputDecoderState],
+    sort_fields_str: &str,
+) -> Result<Vec<RecordBatch>> {
+    // Single-region path: drain RG 0 of each input. Used by the
+    // single-region streaming path (one region covering all inputs;
+    // applies when all inputs are single-RG OR `rg_partition_prefix_len
+    // == 0` with one synthetic adapter-presented RG per input).
+    let mut batches = Vec::with_capacity(decoders_state.len());
+    for (idx, state) in decoders_state.iter_mut().enumerate() {
+        let batch = drain_sort_cols_one_input(handle, state, sort_fields_str, idx, 0)?;
+        if batch.num_columns() > 0 && batch.schema().index_of(SORTED_SERIES_COLUMN).is_err() {
+            bail!(
+                "input {idx} is missing the '{}' column required for merge",
+                SORTED_SERIES_COLUMN,
+            );
+        }
+        batches.push(batch);
+    }
+    Ok(batches)
+}
+
+/// Process one merge region: obtain the contributing inputs' sort col
+/// batches (drain them fresh from the streaming decoder when
+/// `prefetched_sort_batches` is `None`; otherwise slice the
+/// pre-drained batches by each contribution's row range — used by the
+/// synthesized prefix-region path), compute the region's merge order,
+/// open a new output RG in the writer, write all cols (sort cols via
+/// interleave, body cols via the page-bounded assembler), close the
+/// RG, accumulate per-output static metadata.
+///
+/// When `prefetched_sort_batches` is provided, each contribution's
+/// `start_row` determines where the body col assembler starts inside
+/// the contributing input's RG. The body col stream is shared with
+/// adjacent synthesized regions of the same input, so its cache /
+/// cursor must advance monotonically — see `reset_body_col_state`.
+#[allow(clippy::too_many_arguments)]
+fn process_region(
+    handle: &Handle,
+    decoders_state: &mut [InputDecoderState],
+    writer_state: &mut OutputWriterStorage,
+    accumulator: &mut OutputAccumulator,
+    region: &Region,
+    union_schema: &SchemaRef,
+    input_meta: &InputMetadata,
+    prefetched_sort_batches: Option<&[RecordBatch]>,
+) -> Result<()> {
+    // 1. Obtain sort col batches for this region's contributing inputs. When prefetched batches are
+    //    supplied (synthesized path), slice them by the contribution's row range so a single RG
+    //    drained once can feed multiple adjacent regions. Otherwise drain a fresh whole-RG batch
+    //    from the streaming decoder. The result is indexed BY GLOBAL INPUT INDEX, with zero-row
+    //    placeholders for non-contributing inputs so the BodyColOutputPageAssembler sees a uniform
+    //    input-count.
+    let num_inputs = decoders_state.len();
+    let mut sort_col_batches: Vec<Option<RecordBatch>> = (0..num_inputs).map(|_| None).collect();
+    for c in &region.contributing {
+        let input_idx = c.input_idx;
+        let rg_idx = c.rg_idx;
+        let batch = match prefetched_sort_batches {
+            Some(prefetched) => prefetched[input_idx].slice(c.start_row, c.num_rows),
+            None => drain_sort_cols_one_input(
+                handle,
+                &mut decoders_state[input_idx],
+                &input_meta.sort_fields,
+                input_idx,
+                rg_idx,
+            )?,
+        };
+        if batch.num_columns() > 0 && batch.schema().index_of(SORTED_SERIES_COLUMN).is_err() {
+            bail!(
+                "input {input_idx} rg {rg_idx} is missing the '{}' column required for merge",
+                SORTED_SERIES_COLUMN,
+            );
+        }
+        sort_col_batches[input_idx] = Some(batch);
+    }
+
+    // Materialise into a `Vec<RecordBatch>` per input. Non-contributing
+    // inputs get zero-row placeholders with the input's sort col schema
+    // so `compute_merge_order` and the body col assembler see uniform
+    // shapes.
+    let mut sort_batch_vec: Vec<RecordBatch> = Vec::with_capacity(num_inputs);
+    for (idx, slot) in sort_col_batches.into_iter().enumerate() {
+        let batch = match slot {
+            Some(b) => b,
+            None => empty_sort_col_record_batch(&decoders_state[idx], &input_meta.sort_fields)?,
+        };
+        sort_batch_vec.push(batch);
+    }
+
+    // Per-input cursor offsets to feed to the body col reset hook
+    // below. For whole-RG regions this is 0 everywhere; for
+    // synthesized regions it is `c.start_row` of the contributing
+    // input so the body col assembler walks rows starting at the
+    // region's first input row instead of restarting at 0.
+    let mut input_start_rows: Vec<usize> = vec![0; num_inputs];
+    for c in &region.contributing {
+        input_start_rows[c.input_idx] = c.start_row;
+    }
+
+    // 2. Align to union sort schema for the merge-order comparator.
+    let (sort_union_schema, aligned_sort_batches) =
+        align_inputs_to_union_schema(&sort_batch_vec, &input_meta.sort_fields)?;
+
+    // 3. Compute merge order for this region.
+    let merge_order = compute_merge_order(&aligned_sort_batches, &input_meta.sort_fields)?;
+    let region_rows: usize = merge_order.iter().map(|r| r.row_count).sum();
+    if region_rows == 0 {
+        return Ok(());
+    }
+
+    // 4. Apply the merge permutation to the sort col batches to get the region's sorted sort-col
+    //    batch. This will be appended to the output accumulator; also used to compute take indices
+    //    for the body col assembler.
+    let region_sort_batch =
+        apply_merge_permutation(&aligned_sort_batches, &sort_union_schema, &merge_order)
+            .context("applying merge permutation for region sort cols")?;
+
+    // MC-3: verify the region's output is sorted.
+    verify_sort_order(&region_sort_batch, &input_meta.sort_fields);
+
+    // 5. Build per-region destinations: maps (input_idx, input_row) → (output_idx=0,
+    //    position_in_region). The body col assembler walks this to find which (input, row)
+    //    contributes each output position.
+    //
+    //    The destinations array is indexed by row *within the sort
+    //    batch* — which for whole-RG regions equals "row within the
+    //    RG" and for synthesized regions equals "row within the
+    //    region's slice". In both cases that index lines up with what
+    //    the body col decoder's `row_start` reports for the current
+    //    RG, modulo the per-input `start_row` offset added below.
+    let mut destinations: Vec<Vec<Option<(usize, usize)>>> = aligned_sort_batches
+        .iter()
+        .enumerate()
+        .map(|(idx, b)| {
+            // For the synthesized path the body col assembler walks
+            // absolute input rows; pad the destinations array so the
+            // index space matches what the page decoder reports.
+            vec![None; input_start_rows[idx] + b.num_rows()]
+        })
+        .collect();
+    let mut pos = 0usize;
+    for run in &merge_order {
+        for r in 0..run.row_count {
+            let absolute_row = input_start_rows[run.input_index] + run.start_row + r;
+            destinations[run.input_index][absolute_row] = Some((0, pos));
+            pos += 1;
+        }
+    }
+    let region_destinations = InputRowDestinations {
+        per_input: destinations,
+        rows_per_output: vec![region_rows],
+    };
+
+    // 6. Open a new output RG and write all cols in union schema order.
+    let mut row_group = writer_state.writer.start_row_group().with_context(|| {
+        format!(
+            "opening row group for output {} region",
+            writer_state.output_idx,
+        )
+    })?;
+    writer_state.num_row_groups += 1;
+
+    for (col_idx, field) in union_schema.fields().iter().enumerate() {
+        let col_name = field.name();
+        if sort_union_schema.index_of(col_name).is_ok() {
+            // Sort col: take from the already-built region_sort_batch.
+            let arrays = build_sort_col_pages_from_sorted_batch(&region_sort_batch, col_name)?;
+            row_group
+                .write_next_column_arrays(arrays)
+                .with_context(|| {
+                    format!(
+                        "writing sort col '{col_name}' (col_idx {col_idx}) to output {}",
+                        writer_state.output_idx,
+                    )
+                })?;
+        } else {
+            // Body col: stream via the page-bounded assembler. Resolve
+            // each input's parquet col_idx for this union-schema col
+            // first, then position the per-col cursor at the region's
+            // `start_row` for that input. For whole-RG regions
+            // `start_row == 0` (no-op for first region); for synthesized
+            // regions sharing an RG with earlier regions the cursor
+            // jumps past already-emitted rows so we don't re-emit them.
+            // The decoder itself is never reset — its per-(rg, col)
+            // `rows_decoded` counters and queued pages must survive so
+            // subsequent decode calls return correct row offsets.
+            let mut input_col_indices: Vec<Option<usize>> = Vec::with_capacity(num_inputs);
+            for state in decoders_state.iter() {
+                input_col_indices.push(state.arrow_schema.index_of(col_name).ok());
+            }
+            for (idx, state) in decoders_state.iter_mut().enumerate() {
+                if let Some(col_parquet_idx) = input_col_indices[idx] {
+                    state.set_body_col_cursor(col_parquet_idx, input_start_rows[idx]);
+                }
+            }
+
+            let track_service = col_name == "service";
+
+            let assembler = BodyColOutputPageAssembler::new(
+                handle,
+                decoders_state,
+                &input_col_indices,
+                &region_destinations,
+                0, // out_idx is always 0 within a single-region call
+                col_name,
+                field.as_ref(),
+            );
+
+            // Feed pages one at a time into `write_next_column_arrays`
+            // via the streaming iterator: it surfaces assembly errors
+            // through `error_slot` so memory stays bounded by output-
+            // page size instead of column-chunk size.
+            let mut error_slot: Option<anyhow::Error> = None;
+            let service_collector: Option<&mut HashSet<String>> = if track_service {
+                Some(&mut accumulator.service_names)
+            } else {
+                None
+            };
+            let stream_iter = StreamingBodyColIter {
+                inner: assembler.into_iter(),
+                error_slot: &mut error_slot,
+                service_collector,
+            };
+            let write_result = row_group.write_next_column_arrays(stream_iter);
+
+            if let Some(err) = error_slot {
+                return Err(err).with_context(|| {
+                    format!(
+                        "assembling body col '{col_name}' for output {} region",
+                        writer_state.output_idx,
+                    )
+                });
+            }
+            write_result.with_context(|| {
+                format!(
+                    "writing body col '{col_name}' to output {} region",
+                    writer_state.output_idx,
+                )
+            })?;
+        }
+    }
+
+    row_group.finish().with_context(|| {
+        format!(
+            "finishing region row group for output {}",
+            writer_state.output_idx
+        )
+    })?;
+
+    // 7. Accumulate this region's contribution to the output.
+    accumulator.append_sort_batch(region_sort_batch)?;
+    accumulator.num_rows += region_rows;
+
+    Ok(())
+}
+
+/// One writer-chunk handed to [`process_split_region_col_outer`].
+///
+/// `region` is the combined region for this chunk: one or more
+/// adjacent sub-regions of a single top-level region that share an
+/// output writer. Coalescing adjacent sub-regions into one Region
+/// lets each chunk's RG cover all the rows that this writer-chunk
+/// owns in one shot, so we open exactly one RG per chunk and the
+/// col-outer body-col loop can interleave writes across N chunks
+/// holding N concurrent RGs (one per writer).
+pub(crate) struct SplitRegionChunk<'a> {
+    pub(crate) writer_state: &'a mut OutputWriterStorage,
+    pub(crate) accumulator: &'a mut OutputAccumulator,
+    pub(crate) region: Region,
+}
+
+/// Per-chunk pre-computed state built up in phase 1 of
+/// [`process_split_region_col_outer`] before any body cols are
+/// written. Persisted across the col-outer loop so each iteration of
+/// the inner loop has the destinations / cursors / row counts it
+/// needs without re-computing.
+struct ChunkPrep {
+    sort_union_schema: SchemaRef,
+    region_sort_batch: RecordBatch,
+    region_destinations: InputRowDestinations,
+    region_rows: usize,
+    input_start_rows: Vec<usize>,
+}
+
+/// Col-outer processor for the prefix_len=0 multi-output sub-region
+/// case. Same per-region work as [`process_region`] (merge order,
+/// permutation, sort cols, body cols), but with the col loop hoisted
+/// OUT of the sub-region loop so each parquet col chunk is fully
+/// consumed from the page stream before the next col starts.
+///
+/// Why: the parquet stream emits pages in column-major order. The
+/// previous sub-region-outer / col-inner ordering meant that while
+/// processing sub-region 0's col K, the stream emitted cols 0..K-1's
+/// remaining pages first (they had to be skipped past to reach col
+/// K). Those skipped pages were cached under their own col_idx for
+/// later sub-regions to consume — and the cache scaled with input
+/// row count (more rows = more leftover pages per col chunk).
+/// Col-outer fully drains col K's chunk across all sub-regions
+/// before reading any col K+1 pages, so cache[col K] empties before
+/// col K+1's pages start arriving. Peak total cache returns to
+/// O(K × num_body_cols × num_chunks) where K is the per-col output-
+/// page bound — bounded independent of row count. See the MS-7
+/// sub-region test for the regression guard.
+///
+/// Each chunk holds one RG concurrently; chunks correspond to
+/// distinct output writers, so the parquet writers' single-active-RG
+/// constraint is respected. Sub-regions belonging to the same writer
+/// are pre-coalesced into one Region by the caller.
+fn process_split_region_col_outer(
+    handle: &Handle,
+    decoders_state: &mut [InputDecoderState],
+    chunks: Vec<SplitRegionChunk<'_>>,
+    union_schema: &SchemaRef,
+    input_meta: &InputMetadata,
+    prefetched_sort_batches: &[RecordBatch],
+) -> Result<()> {
+    if chunks.is_empty() {
+        return Ok(());
+    }
+    let num_inputs = decoders_state.len();
+
+    // Phase 1: per-chunk prep — sort batches, merge order, permutation,
+    // destinations. No I/O against the body-col stream yet.
+    let mut preps: Vec<ChunkPrep> = Vec::with_capacity(chunks.len());
+    for chunk in &chunks {
+        let mut sort_col_batches: Vec<Option<RecordBatch>> =
+            (0..num_inputs).map(|_| None).collect();
+        for c in &chunk.region.contributing {
+            // Slice the prefetched (already-drained for the whole top-
+            // level region) sort batch by this chunk's row range for
+            // each contributing input.
+            sort_col_batches[c.input_idx] =
+                Some(prefetched_sort_batches[c.input_idx].slice(c.start_row, c.num_rows));
+        }
+        let mut sort_batch_vec: Vec<RecordBatch> = Vec::with_capacity(num_inputs);
+        for (idx, slot) in sort_col_batches.into_iter().enumerate() {
+            let batch = match slot {
+                Some(b) => b,
+                None => empty_sort_col_record_batch(&decoders_state[idx], &input_meta.sort_fields)?,
+            };
+            sort_batch_vec.push(batch);
+        }
+
+        let mut input_start_rows: Vec<usize> = vec![0; num_inputs];
+        for c in &chunk.region.contributing {
+            input_start_rows[c.input_idx] = c.start_row;
+        }
+
+        let (sort_union_schema, aligned_sort_batches) =
+            align_inputs_to_union_schema(&sort_batch_vec, &input_meta.sort_fields)?;
+        let merge_order = compute_merge_order(&aligned_sort_batches, &input_meta.sort_fields)?;
+        let region_rows: usize = merge_order.iter().map(|r| r.row_count).sum();
+        let region_sort_batch =
+            apply_merge_permutation(&aligned_sort_batches, &sort_union_schema, &merge_order)
+                .context("applying merge permutation for split-region chunk")?;
+        verify_sort_order(&region_sort_batch, &input_meta.sort_fields);
+
+        let mut destinations: Vec<Vec<Option<(usize, usize)>>> = aligned_sort_batches
+            .iter()
+            .enumerate()
+            .map(|(idx, b)| vec![None; input_start_rows[idx] + b.num_rows()])
+            .collect();
+        let mut pos = 0usize;
+        for run in &merge_order {
+            for r in 0..run.row_count {
+                let absolute_row = input_start_rows[run.input_index] + run.start_row + r;
+                destinations[run.input_index][absolute_row] = Some((0, pos));
+                pos += 1;
+            }
+        }
+        let region_destinations = InputRowDestinations {
+            per_input: destinations,
+            rows_per_output: vec![region_rows],
+        };
+
+        preps.push(ChunkPrep {
+            sort_union_schema,
+            region_sort_batch,
+            region_destinations,
+            region_rows,
+            input_start_rows,
+        });
+    }
+
+    // Phase 2: open one RG per chunk and bundle the per-chunk
+    // resources (rg + accumulator + num_row_groups ref + prep) into a
+    // single Vec. Consuming `chunks` in one pass avoids the
+    // borrow-checker conflict of holding rg-borrows of chunks while
+    // re-borrowing chunks for the body col loop.
+    struct ActiveChunk<'a> {
+        rg: crate::storage::streaming_writer::RowGroupBuilder<'a, std::fs::File>,
+        accumulator: &'a mut OutputAccumulator,
+        num_row_groups: &'a mut usize,
+        output_idx: usize,
+        prep: ChunkPrep,
+    }
+
+    let mut actives: Vec<ActiveChunk<'_>> = Vec::with_capacity(chunks.len());
+    for (chunk, prep) in chunks.into_iter().zip(preps) {
+        let SplitRegionChunk {
+            writer_state,
+            accumulator,
+            region: _,
+        } = chunk;
+        let output_idx = writer_state.output_idx;
+        // Split writer_state into the writer field (consumed by rg)
+        // and the num_row_groups field (held separately): disjoint
+        // field borrows through a &mut OutputWriterStorage are
+        // allowed.
+        let OutputWriterStorage {
+            writer,
+            num_row_groups,
+            ..
+        } = writer_state;
+        let rg = writer.start_row_group().with_context(|| {
+            format!("opening row group for split-region chunk output {output_idx}")
+        })?;
+        actives.push(ActiveChunk {
+            rg,
+            accumulator,
+            num_row_groups,
+            output_idx,
+            prep,
+        });
+    }
+
+    // Phase 3: write all columns in union-schema order. For each col,
+    // every chunk writes that col to its RG before any chunk moves
+    // on to the next col. Sort cols write from each chunk's
+    // already-sorted region_sort_batch (no stream involvement).
+    // Body cols stream through the per-input page cache; col-outer
+    // ordering ensures col K's chunk pages are fully consumed before
+    // col K+1's pages enter the stream.
+    let sort_field_names: Vec<String> = actives[0]
+        .prep
+        .sort_union_schema
+        .fields()
+        .iter()
+        .map(|f| f.name().clone())
+        .collect();
+    for (col_idx, field) in union_schema.fields().iter().enumerate() {
+        let col_name = field.name();
+        let is_sort_col = sort_field_names.iter().any(|n| n == col_name);
+
+        if is_sort_col {
+            for (chunk_idx, active) in actives.iter_mut().enumerate() {
+                let arrays = build_sort_col_pages_from_sorted_batch(
+                    &active.prep.region_sort_batch,
+                    col_name,
+                )?;
+                active
+                    .rg
+                    .write_next_column_arrays(arrays)
+                    .with_context(|| {
+                        format!(
+                            "writing sort col '{col_name}' (col_idx {col_idx}) to split-region \
+                             chunk {chunk_idx}",
+                        )
+                    })?;
+            }
+        } else {
+            let mut input_col_indices: Vec<Option<usize>> = Vec::with_capacity(num_inputs);
+            for state in decoders_state.iter() {
+                input_col_indices.push(state.arrow_schema.index_of(col_name).ok());
+            }
+            let track_service = col_name == "service";
+
+            for (chunk_idx, active) in actives.iter_mut().enumerate() {
+                for (idx, state) in decoders_state.iter_mut().enumerate() {
+                    if let Some(col_parquet_idx) = input_col_indices[idx] {
+                        state.set_body_col_cursor(
+                            col_parquet_idx,
+                            active.prep.input_start_rows[idx],
+                        );
+                    }
+                }
+
+                let assembler = BodyColOutputPageAssembler::new(
+                    handle,
+                    decoders_state,
+                    &input_col_indices,
+                    &active.prep.region_destinations,
+                    0,
+                    col_name,
+                    field.as_ref(),
+                );
+
+                let mut error_slot: Option<anyhow::Error> = None;
+                let service_collector: Option<&mut HashSet<String>> = if track_service {
+                    Some(&mut active.accumulator.service_names)
+                } else {
+                    None
+                };
+                let stream_iter = StreamingBodyColIter {
+                    inner: assembler.into_iter(),
+                    error_slot: &mut error_slot,
+                    service_collector,
+                };
+                let write_result = active.rg.write_next_column_arrays(stream_iter);
+
+                if let Some(err) = error_slot {
+                    return Err(err).with_context(|| {
+                        format!(
+                            "assembling body col '{col_name}' for split-region chunk {chunk_idx}",
+                        )
+                    });
+                }
+                write_result.with_context(|| {
+                    format!("writing body col '{col_name}' to split-region chunk {chunk_idx}",)
+                })?;
+            }
+        }
+    }
+
+    // Phase 4: finalize each chunk's RG and accumulate per-output
+    // metadata. RG finish must precede num_row_groups bump (MS-3:
+    // they have to agree).
+    for active in actives.into_iter() {
+        let ActiveChunk {
+            rg,
+            accumulator,
+            num_row_groups,
+            output_idx,
+            prep,
+        } = active;
+        rg.finish().with_context(|| {
+            format!("finishing row group for split-region chunk output {output_idx}",)
+        })?;
+        *num_row_groups += 1;
+        accumulator.append_sort_batch(prep.region_sort_batch)?;
+        accumulator.num_rows += prep.region_rows;
+    }
+
+    Ok(())
+}
+
+/// Helper for sort col writes within a region: split the region's
+/// already-sorted sort col into page-sized chunks for
+/// `write_next_column_arrays`.
+fn build_sort_col_pages_from_sorted_batch(
+    sorted_batch: &RecordBatch,
+    col_name: &str,
+) -> Result<Vec<ArrayRef>> {
+    let col_idx = sorted_batch
+        .schema()
+        .index_of(col_name)
+        .with_context(|| format!("missing sort col '{col_name}' in region sorted batch"))?;
+    let col = sorted_batch.column(col_idx);
+    let total_rows = col.len();
+    let mut pages = Vec::with_capacity(total_rows.div_ceil(OUTPUT_PAGE_ROWS));
+    let mut start = 0;
+    while start < total_rows {
+        let len = (total_rows - start).min(OUTPUT_PAGE_ROWS);
+        pages.push(col.slice(start, len));
+        start += len;
+    }
+    Ok(pages)
+}
+
+fn drain_sort_cols_one_input(
+    handle: &Handle,
+    state: &mut InputDecoderState,
+    sort_fields_str: &str,
+    input_idx: usize,
+    expected_rg_idx: usize,
+) -> Result<RecordBatch> {
+    if state.metadata.num_row_groups() == 0 || expected_rg_idx >= state.metadata.num_row_groups() {
+        // No rows to drain at this RG. Return a zero-row batch with the
+        // sort cols' fields preserved so downstream merge order code sees a
+        // uniform schema across inputs.
+        return empty_sort_col_record_batch(state, sort_fields_str);
+    }
+    let sort_field_schema = parse_sort_fields(sort_fields_str)?;
+
+    // The set of column names we treat as "sort columns" for drain
+    // purposes: every sort-schema column name that is present in this
+    // input's arrow schema, plus `sorted_series` (always required).
+    let sort_col_names: HashSet<String> =
+        sort_col_names_for_input(&sort_field_schema, state.arrow_schema.as_ref());
+
+    // Map each sort col name → its parquet leaf column index. The
+    // page decoder reports pages by parquet column index (matches arrow
+    // top-level field index when there are no nested types).
+    let parquet_schema = state.metadata.file_metadata().schema_descr();
+    let mut sort_col_parquet_indices: HashMap<usize, String> = HashMap::new();
+    for (col_idx, col) in parquet_schema.columns().iter().enumerate() {
+        // For flat schemas (one leaf per top-level field), the parquet
+        // column index equals the arrow top-level field index. We
+        // match by name: parquet `column_path` root → arrow field name.
+        let name = col.path().parts()[0].to_string();
+        if sort_col_names.contains(&name) {
+            sort_col_parquet_indices.insert(col_idx, name);
+        }
+    }
+
+    if sort_col_parquet_indices.is_empty() {
+        // No sort cols present in this input — return an empty batch
+        // with the input's arrow schema. Downstream merge order check
+        // will catch the missing `sorted_series`.
+        return Ok(RecordBatch::new_empty(Arc::clone(&state.arrow_schema)));
+    }
+
+    // Target row count per sort col (from row group's column chunk metadata).
+    let rg_meta = state.metadata.row_group(expected_rg_idx);
+    let mut target_rows_per_col: HashMap<usize, usize> = HashMap::new();
+    for &col_idx in sort_col_parquet_indices.keys() {
+        target_rows_per_col.insert(col_idx, rg_meta.column(col_idx).num_values() as usize);
+    }
+
+    // Drain pages into per-col buffers until all sort cols are fully
+    // decoded for this RG. The streaming engine relies on a hard
+    // storage-ordering contract: within each row group parquet emits
+    // column chunks in schema order (sort cols declared first, body
+    // cols after), so every sort-col page appears before any body-col
+    // page. Cross-file we do NOT require identical body-col ordering —
+    // the body-col loop drives from the union schema and looks each
+    // column up by name; the contract we do require cross-file is
+    // "sort cols come first." A page from a body col arriving here
+    // means a producer violated that contract; bail rather than guess.
+    let mut per_col_pages: HashMap<usize, Vec<ArrayRef>> = HashMap::new();
+    let mut rows_done_per_col: HashMap<usize, usize> =
+        sort_col_parquet_indices.keys().map(|&i| (i, 0)).collect();
+    let mut sort_cols_finished = 0usize;
+    let sort_col_target = sort_col_parquet_indices.len();
+
+    while sort_cols_finished < sort_col_target {
+        let decoded = handle
+            .block_on(state.decoder.decode_next_page())
+            .with_context(|| {
+                format!("decoding sort col page (input {input_idx}, rg {expected_rg_idx})")
+            })?;
+        let page = match decoded {
+            Some(p) => p,
+            None => bail!(
+                "stream ended before sort cols fully drained for input {input_idx} rg \
+                 {expected_rg_idx}: {sort_cols_finished}/{sort_col_target} cols complete",
+            ),
+        };
+
+        if !sort_col_parquet_indices.contains_key(&page.col_idx) {
+            bail!(
+                "input {input_idx} returned a non-sort page (col {}) before all sort cols were \
+                 drained for rg {expected_rg_idx} — this violates Husky storage ordering",
+                page.col_idx,
+            );
+        }
+        if page.rg_idx != expected_rg_idx {
+            bail!(
+                "input {input_idx} returned a page from rg {} while draining sort cols of rg \
+                 {expected_rg_idx}",
+                page.rg_idx,
+            );
+        }
+
+        let array_len = page.array.len();
+        let rows_done = rows_done_per_col
+            .get_mut(&page.col_idx)
+            .expect("sort_col_parquet_indices.contains_key check above guarantees presence");
+        *rows_done += array_len;
+        per_col_pages
+            .entry(page.col_idx)
+            .or_default()
+            .push(page.array);
+
+        if *rows_done == target_rows_per_col[&page.col_idx] {
+            sort_cols_finished += 1;
+        } else if *rows_done > target_rows_per_col[&page.col_idx] {
+            bail!(
+                "input {input_idx} col {} decoded more rows ({}) than expected ({})",
+                page.col_idx,
+                rows_done,
+                target_rows_per_col[&page.col_idx],
+            );
+        }
+    }
+
+    // Build a RecordBatch holding just the sort cols. Field order
+    // matches the arrow schema's order (so downstream consumers see
+    // the same field order whether or not body cols are present).
+    let mut fields: Vec<Arc<Field>> = Vec::new();
+    let mut columns: Vec<ArrayRef> = Vec::new();
+    for (field_idx, field) in state.arrow_schema.fields().iter().enumerate() {
+        let Some(_name) = sort_col_parquet_indices.get(&field_idx) else {
+            continue;
+        };
+        let pages = per_col_pages.remove(&field_idx).expect("col drained");
+        let concatenated = concat_arrays(&pages).with_context(|| {
+            format!(
+                "concatenating sort col '{}' pages for input {input_idx}",
+                field.name(),
+            )
+        })?;
+        fields.push(Arc::clone(field));
+        columns.push(concatenated);
+    }
+
+    let schema = Arc::new(ArrowSchema::new(fields));
+    RecordBatch::try_new(schema, columns)
+        .with_context(|| format!("building sort col record batch for input {input_idx}"))
+}
+
+/// Set of column names treated as "sort cols" for phase 0 drain.
+/// Verify that every input's body cols (everything after the sort
+/// cols and `sorted_series`) are in lexicographic order by name —
+/// the same "Husky order" the union schema uses.
+///
+/// **Why this is a hard contract, not a nicety.** Phase 3 iterates
+/// the union schema's body cols alphabetically and asks each input's
+/// decoder to advance to that col via [`fill_page_cache_to_row`].
+/// Parquet emits column chunks in declared schema order, so the
+/// decoder reads pages in *the input's* storage order. If an input's
+/// body cols aren't in the same order as the union iteration, the
+/// decoder has to drain whatever cols precede the requested one on
+/// the wire — those pages get cached under their own `col_idx`,
+/// growing the cache to a full column-chunk's worth for the
+/// misaligned input. Streaming becomes vertical-per-column instead
+/// of horizontal-per-input-page, defeating the O(N × constant)
+/// memory bound documented on
+/// [`InputDecoderState::body_col_page_cache`].
+///
+/// Bails on the first violation with the offending input index and
+/// the offending pair of column names. Returning `Ok(())` here is
+/// the precondition that lets the body-col memory-bound argument
+/// hold across the merge.
+fn assert_inputs_in_husky_body_col_order(
+    decoders_state: &[InputDecoderState],
+    sort_fields_str: &str,
+) -> Result<()> {
+    let sort_field_schema = parse_sort_fields(sort_fields_str)?;
+    for (input_idx, state) in decoders_state.iter().enumerate() {
+        let sort_or_structural =
+            sort_col_names_for_input(&sort_field_schema, state.arrow_schema.as_ref());
+        let body_cols: Vec<&str> = state
+            .arrow_schema
+            .fields()
+            .iter()
+            .map(|f| f.name().as_str())
+            .filter(|n| !sort_or_structural.contains(*n))
+            .collect();
+        for window in body_cols.windows(2) {
+            if window[0] >= window[1] {
+                bail!(
+                    "input {input_idx} body cols not in Husky (alphabetical) order: '{}' precedes \
+                     '{}' in storage order. Streaming-merge memory bound requires body cols to be \
+                     emitted lexicographically by name within each input so the union schema's \
+                     iteration order matches storage order on the wire. Full body-col storage \
+                     order observed: {body_cols:?}",
+                    window[0],
+                    window[1],
+                );
+            }
+        }
+    }
+    Ok(())
+}
+
+fn sort_col_names_for_input(
+    sort_field_schema: &quickwit_proto::sortschema::SortSchema,
+    arrow_schema: &ArrowSchema,
+) -> HashSet<String> {
+    let mut names: HashSet<String> = HashSet::new();
+    for sf in &sort_field_schema.column {
+        if arrow_schema.field_with_name(&sf.name).is_ok() {
+            names.insert(sf.name.clone());
+        }
+        // Legacy schemas may declare `timestamp` but the column is named
+        // `timestamp_secs`. The merge order code already handles this
+        // alias; we want both candidates drained whichever matches.
+        if is_timestamp_column_name(&sf.name)
+            && arrow_schema.field_with_name("timestamp_secs").is_ok()
+        {
+            names.insert("timestamp_secs".to_string());
+        }
+    }
+    if arrow_schema.field_with_name(SORTED_SERIES_COLUMN).is_ok() {
+        names.insert(SORTED_SERIES_COLUMN.to_string());
+    }
+    names
+}
+
+/// Build a zero-row RecordBatch with the input's sort cols + `sorted_series`.
+/// Used when an input file has zero rows (no row groups) so that downstream
+/// k-way merge sees a consistent schema shape across inputs.
+fn empty_sort_col_record_batch(
+    state: &InputDecoderState,
+    sort_fields_str: &str,
+) -> Result<RecordBatch> {
+    let sort_field_schema = parse_sort_fields(sort_fields_str)?;
+    let sort_col_names = sort_col_names_for_input(&sort_field_schema, state.arrow_schema.as_ref());
+    let mut fields: Vec<Arc<Field>> = Vec::new();
+    let mut columns: Vec<ArrayRef> = Vec::new();
+    for field in state.arrow_schema.fields() {
+        if !sort_col_names.contains(field.name()) {
+            continue;
+        }
+        fields.push(Arc::clone(field));
+        columns.push(new_null_array(field.data_type(), 0));
+    }
+    let schema = Arc::new(ArrowSchema::new(fields));
+    RecordBatch::try_new(schema, columns).context("building empty sort col record batch")
+}
+
+fn concat_arrays(arrays: &[ArrayRef]) -> Result<ArrayRef> {
+    if arrays.len() == 1 {
+        return Ok(Arc::clone(&arrays[0]));
+    }
+    let refs: Vec<&dyn Array> = arrays.iter().map(|a| a.as_ref()).collect();
+    Ok(arrow::compute::concat(&refs)?)
+}
+
+// ============================================================================
+// Pre-compute input row → output destination map
+// ============================================================================
+
+/// `destinations[input_idx][input_row] = Some((output_idx, output_pos))`
+/// if that input row contributes to output `output_idx` at position
+/// `output_pos` within that output's row range. `None` means the row
+/// is not in any output (only possible for rows beyond the merge
+/// plan's coverage; shouldn't happen with our merge order).
+#[derive(Debug)]
+pub(crate) struct InputRowDestinations {
+    /// One Vec per input. Length = input's sort-col row count.
+    pub(crate) per_input: Vec<Vec<Option<(usize, usize)>>>,
+    /// Total rows per output index (cumulative writer "expected" rows).
+    pub(crate) rows_per_output: Vec<usize>,
+}
+
+fn build_input_row_destinations(
+    aligned_sort_batches: &[RecordBatch],
+    merge_order: &[MergeRun],
+    boundaries: &[Range<usize>],
+) -> InputRowDestinations {
+    let mut per_input: Vec<Vec<Option<(usize, usize)>>> = aligned_sort_batches
+        .iter()
+        .map(|b| vec![None; b.num_rows()])
+        .collect();
+    let mut rows_per_output: Vec<usize> = vec![0; boundaries.len()];
+
+    for (out_idx, boundary) in boundaries.iter().enumerate() {
+        let runs = &merge_order[boundary.clone()];
+        let mut rows_for_current_output = 0usize;
+        for run in runs {
+            for r in 0..run.row_count {
+                let input_row = run.start_row + r;
+                per_input[run.input_index][input_row] = Some((out_idx, rows_for_current_output));
+                rows_for_current_output += 1;
+            }
+        }
+        rows_per_output[out_idx] = rows_for_current_output;
+    }
+
+    InputRowDestinations {
+        per_input,
+        rows_per_output,
+    }
+}
+
+// ============================================================================
+// Obsolete PR-6b.2 multi-output-parallel helpers (deleted in PR-6c.2's
+// per-region restructure). The functions below are no longer used —
+// per-region processing in `process_region` is the new path.
+// ============================================================================
+
+/// Build the full union schema across all inputs' arrow schemas
+/// (NOT just sort cols). Reuses the same algorithm as
+/// [`align_inputs_to_union_schema`] but takes pre-extracted arrow
+/// schemas — phase 3 doesn't have full input batches.
+fn build_full_union_schema_from_arrow_schemas(
+    arrow_schemas: &[SchemaRef],
+    sort_fields_str: &str,
+) -> Result<SchemaRef> {
+    // Build zero-row batches with the right schemas; that lets us
+    // reuse `align_inputs_to_union_schema`'s field-merge / storage-
+    // ordering logic unchanged.
+    let empty_batches: Vec<RecordBatch> = arrow_schemas
+        .iter()
+        .map(|s| RecordBatch::new_empty(Arc::clone(s)))
+        .collect();
+    let (schema, _) = align_inputs_to_union_schema(&empty_batches, sort_fields_str)?;
+    Ok(schema)
+}
+
+// ============================================================================
+// Test support
+// ============================================================================
+
+/// Process-wide serial lock for tests that run streaming merges. Every
+/// streaming merge writes to the global `PEAK_BODY_COL_PAGE_CACHE_LEN`
+/// atomic; MS-7 tests reset-then-read it and would race any
+/// concurrent merge in the same test binary. Both the MS-7 tests in
+/// this module and any other test that invokes
+/// `streaming_merge_sorted_parquet_files` (engine-crate parity tests
+/// in `merge::tests::parity`, indexing-crate pipeline tests that
+/// reset the atomic for the same reason) must acquire this lock for
+/// the duration of the merge.
+///
+/// Exposed `pub` under the `testsuite` feature so cross-crate tests
+/// can share the same lock — the streaming engine's atomic is
+/// process-global and the lock has to be too.
+///
+/// Held across `.await` points in MS-7 tests — that's why each
+/// MS-7 test allows `clippy::await_holding_lock`. The lock is
+/// `std::sync::Mutex` and the `#[tokio::test]` runtime is
+/// single-threaded, so holding the guard across await won't deadlock
+/// another thread. `tokio::sync::Mutex` is forbidden by GAP-002.
+#[cfg(any(test, feature = "testsuite"))]
+pub fn ms7_serial_lock() -> std::sync::MutexGuard<'static, ()> {
+    static LOCK: std::sync::Mutex<()> = std::sync::Mutex::new(());
+    // Poisoning is fine — a previous test panicking shouldn't prevent
+    // the next one from acquiring; just unwrap the inner.
+    LOCK.lock().unwrap_or_else(|poisoned| poisoned.into_inner())
+}
+
+// ============================================================================
+// Tests
+// ============================================================================
+
+#[cfg(test)]
+mod tests {
+    use std::path::PathBuf;
+    use std::sync::Arc;
+
+    use arrow::array::{
+        ArrayRef, BinaryArray, DictionaryArray, Float64Array, Int64Array, StringArray, UInt8Array,
+        UInt64Array,
+    };
+    use arrow::datatypes::{DataType, Field, Int32Type, Schema as ArrowSchema};
+    use base64::Engine;
+    use base64::engine::general_purpose::STANDARD as BASE64;
+    use bytes::Bytes;
+    use parquet::arrow::ArrowWriter;
+    use parquet::file::metadata::KeyValue;
+    use parquet::file::properties::WriterProperties;
+    use parquet::file::reader::{FileReader, SerializedFileReader};
+    use tempfile::TempDir;
+    use tokio::io::AsyncRead;
+
+    use super::region_grouping::{
+        assert_unique_rg_prefix_keys, extract_rg_composite_prefix_key,
+        find_prefix_parquet_col_indices,
+    };
+    use super::*;
+    use crate::split::TAG_SERVICE;
+    use crate::storage::page_decoder::StreamDecoder;
+    use crate::storage::streaming_reader::{RemoteByteSource, StreamingParquetReader};
+    use crate::storage::{
+        Compression, PARQUET_META_ROW_KEYS, PARQUET_META_ZONEMAP_REGEXES, ParquetWriterConfig,
+    };
+
+    // -------- Fixtures --------
+
+    /// Build a sorted metrics RecordBatch with `num_rows` rows in
+    /// the **storage column order**: sort cols (metric_name, timestamp_secs)
+    /// → sorted_series → remaining body cols lexicographic
+    /// (metric_type, service, timeseries_id, value). All rows share
+    /// the single metric_name "cpu.usage". `sorted_series` is monotonic
+    /// from `start_series_idx`. `service` carries nulls every 5th row.
+    fn make_sorted_batch(num_rows: usize, start_series_idx: u64) -> RecordBatch {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            // sort cols (in sort schema order)
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            // sorted_series marker
+            Field::new("sorted_series", DataType::Binary, false),
+            // body cols lexicographic
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("service", dict_type, true),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let metric_keys: Vec<i32> = (0..num_rows as i32).map(|_| 0).collect();
+        let metric_values = StringArray::from(vec!["cpu.usage", "memory.used"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .expect("test dict array"),
+        );
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; num_rows]));
+        let timestamps: Vec<u64> = (0..num_rows as u64)
+            .map(|i| 1_700_000_000 + (num_rows as u64 - i))
+            .collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let values: Vec<f64> = (0..num_rows).map(|i| i as f64).collect();
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+        let tsids: Vec<i64> = (0..num_rows as i64).map(|i| 1000 + i).collect();
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+        let svc_keys: Vec<Option<i32>> = (0..num_rows as i32)
+            .map(|i| if i % 5 == 0 { None } else { Some(i % 3) })
+            .collect();
+        let svc_values = StringArray::from(vec!["api", "db", "cache"]);
+        let service: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(svc_keys),
+                Arc::new(svc_values),
+            )
+            .expect("test dict array"),
+        );
+        let mut series_bytes: Vec<Vec<u8>> = Vec::with_capacity(num_rows);
+        for i in 0..num_rows as u64 {
+            let id = start_series_idx + i;
+            series_bytes.push(id.to_be_bytes().to_vec());
+        }
+        let series_refs: Vec<&[u8]> = series_bytes.iter().map(|v| v.as_slice()).collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(series_refs));
+
+        RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                timestamp_secs,
+                sorted_series,
+                metric_type,
+                service,
+                timeseries_id,
+                value,
+            ],
+        )
+        .expect("test batch")
+    }
+
+    /// Write a fixture parquet file with the standard `qh.*` KVs that the
+    /// streaming merge engine validates.
+    fn write_input_parquet(batches: &[RecordBatch], extra_kvs: &[(&str, &str)]) -> Bytes {
+        let schema = batches[0].schema();
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            ..ParquetWriterConfig::default()
+        };
+        let sort_fields = "metric_name|-timestamp_secs/V2";
+        let sort_field_names = vec!["metric_name".to_string(), "timestamp_secs".to_string()];
+        let mut kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                sort_fields.to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+        ];
+        for (k, v) in extra_kvs {
+            kvs.push(KeyValue::new(k.to_string(), v.to_string()));
+        }
+        let sorting_cols = vec![
+            parquet::file::metadata::SortingColumn {
+                column_idx: schema.index_of("metric_name").expect("test schema") as i32,
+                descending: false,
+                nulls_first: false,
+            },
+            parquet::file::metadata::SortingColumn {
+                column_idx: schema.index_of("timestamp_secs").expect("test schema") as i32,
+                descending: true,
+                nulls_first: false,
+            },
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            sorting_cols,
+            Some(kvs),
+            &sort_field_names,
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut buf, schema, Some(props)).expect("arrow writer");
+        for b in batches {
+            writer.write(b).expect("test write");
+        }
+        writer.close().expect("test close");
+        Bytes::from(buf)
+    }
+
+    // -------- In-memory byte source --------
+
+    struct InMemorySource {
+        bytes: Bytes,
+    }
+
+    #[async_trait::async_trait]
+    impl RemoteByteSource for InMemorySource {
+        async fn file_size(&self, _path: &std::path::Path) -> std::io::Result<u64> {
+            Ok(self.bytes.len() as u64)
+        }
+        async fn get_slice(
+            &self,
+            _path: &std::path::Path,
+            range: std::ops::Range<u64>,
+        ) -> std::io::Result<Bytes> {
+            Ok(self.bytes.slice(range.start as usize..range.end as usize))
+        }
+        async fn get_slice_stream(
+            &self,
+            _path: &std::path::Path,
+            range: std::ops::Range<u64>,
+        ) -> std::io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+            let slice = self.bytes.slice(range.start as usize..range.end as usize);
+            Ok(Box::new(std::io::Cursor::new(slice.to_vec())))
+        }
+    }
+
+    async fn open_stream(bytes: Bytes) -> Box<dyn ColumnPageStream> {
+        let source = Arc::new(InMemorySource { bytes });
+        let reader = StreamingParquetReader::try_open(source, PathBuf::from("test.parquet"))
+            .await
+            .expect("open reader");
+        Box::new(reader)
+    }
+
+    /// Read an output parquet file back into a single concatenated RecordBatch.
+    fn read_output_to_record_batch(path: &Path) -> RecordBatch {
+        let bytes = std::fs::read(path).expect("read output");
+        let builder = parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder::try_new(
+            Bytes::from(bytes),
+        )
+        .expect("read output builder");
+        let schema = builder.schema().clone();
+        let reader = builder.build().expect("read output build");
+        let batches: Vec<RecordBatch> = reader.collect::<Result<Vec<_>, _>>().expect("read output");
+        if batches.is_empty() {
+            RecordBatch::new_empty(schema)
+        } else {
+            arrow::compute::concat_batches(&schema, &batches).expect("concat")
+        }
+    }
+
+    fn merge_config(num_outputs: usize) -> MergeConfig {
+        MergeConfig {
+            num_outputs,
+            writer_config: ParquetWriterConfig {
+                compression: Compression::Snappy,
+                ..ParquetWriterConfig::default()
+            },
+        }
+    }
+
+    // -------- Tests --------
+
+    /// Two inputs → one output: row count and sort order preserved.
+    #[tokio::test]
+    async fn test_two_inputs_simple_merge() {
+        let batch_a = make_sorted_batch(50, 0);
+        let batch_b = make_sorted_batch(50, 50);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+        let bytes_b = write_input_parquet(std::slice::from_ref(&batch_b), &[]);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> =
+            vec![open_stream(bytes_a).await, open_stream(bytes_b).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, 100);
+
+        let merged = read_output_to_record_batch(&outputs[0].path);
+        assert_eq!(merged.num_rows(), 100);
+        let ss_array = merged.column(merged.schema().index_of("sorted_series").expect("col"));
+        let ss = ss_array
+            .as_any()
+            .downcast_ref::<BinaryArray>()
+            .expect("binary");
+        for i in 0..ss_array.len().saturating_sub(1) {
+            assert!(
+                ss.value(i) <= ss.value(i + 1),
+                "row {i}: sorted_series not ascending",
+            );
+        }
+    }
+
+    /// Single-metric_name input + num_outputs=1 → output is single row group.
+    #[tokio::test]
+    async fn test_output_is_single_row_group() {
+        let batch_a = make_sorted_batch(200, 0);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes_a).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+
+        let bytes = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes)).expect("ser reader");
+        assert_eq!(
+            reader.metadata().num_row_groups(),
+            1,
+            "single-metric_name single-output merge must produce single row group",
+        );
+    }
+
+    /// N inputs → M outputs: total row count preserved (MC-1).
+    #[tokio::test]
+    async fn test_total_rows_preserved() {
+        let batch_a = make_sorted_batch(75, 0);
+        let batch_b = make_sorted_batch(50, 100);
+        let batch_c = make_sorted_batch(25, 200);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+        let bytes_b = write_input_parquet(std::slice::from_ref(&batch_b), &[]);
+        let bytes_c = write_input_parquet(std::slice::from_ref(&batch_c), &[]);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![
+            open_stream(bytes_a).await,
+            open_stream(bytes_b).await,
+            open_stream(bytes_c).await,
+        ];
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(2))
+            .await
+            .expect("merge");
+
+        let total: usize = outputs.iter().map(|o| o.num_rows).sum();
+        assert_eq!(total, 150);
+    }
+
+    /// Sort schema mismatch across inputs is rejected.
+    #[tokio::test]
+    async fn test_sort_schema_mismatch_rejected() {
+        let batch_a = make_sorted_batch(20, 0);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "service|-timestamp_secs/V2".to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &batch_a.schema(),
+            Vec::new(),
+            Some(kvs),
+            &["service".to_string(), "timestamp_secs".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer =
+            ArrowWriter::try_new(&mut buf, batch_a.schema(), Some(props)).expect("arrow writer");
+        writer.write(&batch_a).expect("write");
+        writer.close().expect("close");
+        let bytes_b = Bytes::from(buf);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> =
+            vec![open_stream(bytes_a).await, open_stream(bytes_b).await];
+        let tmp = TempDir::new().expect("tmpdir");
+        let err = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect_err("must reject mismatched sort schema");
+        let s = err.to_string();
+        assert!(
+            s.contains("sort schema mismatch"),
+            "expected 'sort schema mismatch', got: {s}",
+        );
+    }
+
+    /// qh.* KV metadata is propagated to the output; num_merge_ops increments.
+    #[tokio::test]
+    async fn test_kv_metadata_propagated_to_output() {
+        let batch_a = make_sorted_batch(40, 0);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes_a).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        let bytes = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes)).expect("ser reader");
+        let kvs = reader
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .cloned()
+            .unwrap_or_default();
+        let find = |k: &str| -> Option<String> {
+            kvs.iter()
+                .find(|kv| kv.key == k)
+                .and_then(|kv| kv.value.clone())
+        };
+        assert_eq!(
+            find(PARQUET_META_SORT_FIELDS).as_deref(),
+            Some("metric_name|-timestamp_secs/V2"),
+        );
+        assert_eq!(
+            find(PARQUET_META_WINDOW_START).as_deref(),
+            Some("1700000000")
+        );
+        assert_eq!(find(PARQUET_META_WINDOW_DURATION).as_deref(), Some("60"));
+        assert_eq!(
+            find(PARQUET_META_NUM_MERGE_OPS).as_deref(),
+            Some("1"),
+            "num_merge_ops must increment by 1 over input's max",
+        );
+    }
+
+    /// All-empty inputs produce no output.
+    #[tokio::test]
+    async fn test_all_empty_inputs_no_output() {
+        let empty = make_sorted_batch(0, 0);
+        let bytes = write_input_parquet(std::slice::from_ref(&empty), &[]);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert!(outputs.is_empty());
+    }
+
+    /// The streaming engine's output can be drained back via the new
+    /// page-bounded decoder. End-to-end sanity check.
+    #[tokio::test]
+    async fn test_output_drainable_by_stream_decoder() {
+        let batch_a = make_sorted_batch(40, 0);
+        let batch_b = make_sorted_batch(40, 40);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+        let bytes_b = write_input_parquet(std::slice::from_ref(&batch_b), &[]);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> =
+            vec![open_stream(bytes_a).await, open_stream(bytes_b).await];
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+
+        let bytes = std::fs::read(&outputs[0].path).expect("read");
+        let mut output_stream = open_stream(Bytes::from(bytes)).await;
+        let mut decoder = StreamDecoder::new(&mut *output_stream);
+        let mut total_decoded = 0usize;
+        while let Some(page) = decoder.decode_next_page().await.expect("decode") {
+            // Count only sort col 0 (col_idx 0) pages to get a row count.
+            if page.col_idx == 0 {
+                total_decoded += page.array.len();
+            }
+        }
+        assert_eq!(total_decoded, 80);
+    }
+
+    /// Page-bounded contract sanity: with a row group large enough to
+    /// require many parquet pages per col, body col writes go through
+    /// the page-by-page assembler instead of materialising column
+    /// chunks. We can't directly observe peak memory from a test, but
+    /// we *can* assert that the merge completes correctly with input
+    /// data whose body cols span many pages, and that the output is
+    /// itself multi-page (no whole-column buffering happened on the
+    /// output side either).
+    #[tokio::test]
+    async fn test_body_col_streams_many_pages_per_column_chunk() {
+        // Force multiple pages per column chunk by setting a small
+        // data_page_row_count_limit. With 8000 rows and a 1000-row
+        // page limit, the output value col chunk must span ≥ 8 pages.
+        let batch = make_sorted_batch(8000, 0);
+        let bytes = write_input_parquet(std::slice::from_ref(&batch), &[]);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let writer_config = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            data_page_row_count_limit: 1000,
+            ..ParquetWriterConfig::default()
+        };
+        let config = MergeConfig {
+            num_outputs: 1,
+            writer_config,
+        };
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &config)
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, 8000);
+
+        // Verify the output is itself multi-page-per-column (which is
+        // what page-bounded writing should produce, given the default
+        // data_page_size). Read via the page-bounded decoder and count
+        // pages for the value column.
+        let out_bytes = std::fs::read(&outputs[0].path).expect("read");
+        let mut output_stream = open_stream(Bytes::from(out_bytes)).await;
+        // Find the "value" col index in the output's arrow schema BEFORE
+        // borrowing output_stream mutably for the decoder.
+        let arrow_schema = parquet::arrow::parquet_to_arrow_schema(
+            output_stream.metadata().file_metadata().schema_descr(),
+            None,
+        )
+        .expect("arrow schema");
+        let value_col_idx = arrow_schema.index_of("value").expect("value col");
+        let mut decoder = StreamDecoder::new(&mut *output_stream);
+
+        let mut value_pages = 0;
+        while let Some(page) = decoder.decode_next_page().await.expect("decode") {
+            if page.col_idx == value_col_idx {
+                value_pages += 1;
+            }
+        }
+        assert!(
+            value_pages >= 2,
+            "expected output 'value' col to span multiple pages (got {value_pages}); body col \
+             writes should respect data_page_size",
+        );
+    }
+
+    /// Multi-RG metric-aligned input (`prefix_len >= 1`) is accepted
+    /// and produces multi-RG output: one output RG per input metric_name
+    /// region.
+    #[tokio::test]
+    async fn test_multi_rg_metric_aligned_input_produces_multi_rg_output() {
+        // Build a fixture with 2 metric_names → 2 RGs each holding one
+        // metric_name. Use `prefix_len = 1` to declare metric_name
+        // alignment.
+        let bytes = make_two_metric_aligned_input();
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge multi-RG metric-aligned input");
+        assert_eq!(outputs.len(), 1, "expected one output file");
+        assert_eq!(outputs[0].num_rows, 60, "30 + 30 rows");
+
+        let out_bytes = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(out_bytes)).expect("ser");
+        assert_eq!(
+            reader.metadata().num_row_groups(),
+            2,
+            "multi-RG metric-aligned input must produce multi-RG output (one RG per metric_name \
+             region)",
+        );
+
+        // `MergeOutputFile.num_row_groups` must agree with the file
+        // on disk. Before the fix it was hard-coded to 1, so this
+        // assertion caught the regression on a multi-region output.
+        assert_eq!(
+            outputs[0].num_row_groups, 2,
+            "MergeOutputFile.num_row_groups should match physical row group count",
+        );
+
+        // F2 chunk-level verification: confirm each output RG actually
+        // carries a single distinct metric_name (PA-1 + PA-3 read
+        // straight off the column-chunk statistics).
+        assert_unique_rg_prefix_keys(
+            reader.metadata(),
+            "metric_name|-timestamp_secs/V2",
+            1,
+            "test_multi_rg_metric_aligned_input_produces_multi_rg_output output",
+        )
+        .expect("streaming engine output must satisfy PA-1 + PA-3 on metric_name");
+    }
+
+    /// Regression for Codex P2 on PR-6410: a streaming merge output
+    /// whose multi-region content lands in a single file must report
+    /// `MergeOutputFile.num_row_groups` consistent with the parquet
+    /// footer. Two regions assigned to one output should yield
+    /// `num_row_groups = 2`.
+    #[tokio::test]
+    async fn test_streaming_output_num_row_groups_matches_footer() {
+        let bytes = make_two_metric_aligned_input();
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+
+        let bytes_out = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+        assert_eq!(
+            outputs[0].num_row_groups,
+            reader.metadata().num_row_groups(),
+            "MergeOutputFile.num_row_groups must match the physical RG count",
+        );
+    }
+
+    /// Regression for Codex P2 on PR-6410: `qh.row_keys` and
+    /// `qh.zonemap_regexes` must be written into the on-disk Parquet
+    /// footer for every streaming output. Per-output values come from
+    /// the rows that landed in that output specifically — merging
+    /// eliminates key overlap between outputs, so this metadata can't
+    /// be carried over from inputs.
+    #[tokio::test]
+    async fn test_streaming_output_kv_footer_contains_row_keys_and_zonemap() {
+        let bytes = make_two_metric_aligned_input();
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+
+        let bytes_out = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+        let kvs: Vec<(String, Option<String>)> = reader
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .map(|v| {
+                v.iter()
+                    .map(|kv| (kv.key.clone(), kv.value.clone()))
+                    .collect()
+            })
+            .unwrap_or_default();
+        let find = |k: &str| kvs.iter().find(|(key, _)| key == k).map(|(_, v)| v.clone());
+
+        // row_keys: streaming merge derives them from THIS output's
+        // sort cols, and the in-memory `MergeOutputFile` records the
+        // proto bytes. The base64-encoded KV in the footer must
+        // round-trip to the same bytes.
+        assert!(
+            outputs[0].row_keys_proto.is_some(),
+            "expected per-output row_keys_proto in MergeOutputFile",
+        );
+        let row_keys_kv = find(PARQUET_META_ROW_KEYS).flatten().expect(
+            "qh.row_keys missing from streaming output footer — appended-after-write KV metadata \
+             did not survive close",
+        );
+        let decoded = BASE64.decode(row_keys_kv).expect("base64 decode");
+        assert_eq!(
+            &decoded,
+            outputs[0].row_keys_proto.as_ref().unwrap(),
+            "footer row_keys bytes must equal MergeOutputFile.row_keys_proto",
+        );
+
+        // zonemap_regexes: footer carries a JSON-encoded map.
+        // metric_name alignment + multiple metric_names → non-empty.
+        assert!(
+            !outputs[0].zonemap_regexes.is_empty(),
+            "expected non-empty zonemap_regexes for multi-metric output",
+        );
+        let zonemap_kv = find(PARQUET_META_ZONEMAP_REGEXES)
+            .flatten()
+            .expect("qh.zonemap_regexes missing from streaming output footer");
+        let parsed: HashMap<String, String> =
+            serde_json::from_str(&zonemap_kv).expect("zonemap JSON parse");
+        assert_eq!(
+            parsed, outputs[0].zonemap_regexes,
+            "footer zonemap must equal MergeOutputFile.zonemap_regexes",
+        );
+    }
+
+    /// Composite-key extraction with two byte-array prefix columns:
+    /// `(metric_name, service)` ASC/ASC. Two RGs with the same
+    /// metric_name but different services must produce distinct
+    /// composite keys, with byte-lex order matching the natural
+    /// `(metric_name, service)` sort order.
+    #[test]
+    fn test_extract_rg_composite_prefix_key_two_byte_array_cols() {
+        let bytes =
+            make_prefix_len_two_input(&[("cpu.usage", "host-a"), ("cpu.usage", "host-b")], 10);
+        let reader = SerializedFileReader::new(bytes).expect("ser");
+        let metadata = reader.metadata();
+        assert_eq!(metadata.num_row_groups(), 2);
+
+        let prefix_cols = find_prefix_parquet_col_indices(
+            metadata,
+            "metric_name|service|-timestamp_secs/V2",
+            2,
+            0,
+        )
+        .expect("resolve");
+        let key_rg0 = extract_rg_composite_prefix_key(metadata, 0, &prefix_cols, 0).expect("rg0");
+        let key_rg1 = extract_rg_composite_prefix_key(metadata, 1, &prefix_cols, 0).expect("rg1");
+        assert_ne!(
+            key_rg0, key_rg1,
+            "different services → different composite keys"
+        );
+        // metric_name is equal across RGs; ordering must come from
+        // service ('host-a' < 'host-b' lex), so key_rg0 < key_rg1.
+        assert!(
+            key_rg0 < key_rg1,
+            "composite key for ('cpu.usage', 'host-a') must lex-sort before ('cpu.usage', \
+             'host-b')",
+        );
+
+        // Encoded representation under the storekey-based encoding
+        // (shared with `sorted_series` via `append_prefix_col_to_key`):
+        // each prefix column contributes `storekey(u8 ord) ||
+        // storekey(str value)`, then the whole key ends with a
+        // `u8(prefix_len)` sentinel ordinal (so an all-null RG's
+        // empty body still sorts after any non-null key — see
+        // `extract_rg_composite_prefix_key` for the argument).
+        //
+        //   [0x00]              ord=0 (metric_name)
+        //   b"cpu.usage" + 0x00 storekey("cpu.usage") — 10 bytes
+        //   [0x01]              ord=1 (service)
+        //   b"host-a"    + 0x00 storekey("host-a")    —  7 bytes
+        //   [0x02]              sentinel u8(prefix_len)
+        //
+        // Total = 1 + 10 + 1 + 7 + 1 = 20 bytes.
+        assert_eq!(key_rg0[0], 0x00);
+        assert_eq!(&key_rg0[1..10], b"cpu.usage");
+        assert_eq!(key_rg0[10], 0x00);
+        assert_eq!(key_rg0[11], 0x01);
+        assert_eq!(&key_rg0[12..18], b"host-a");
+        assert_eq!(key_rg0[18], 0x00);
+        assert_eq!(key_rg0[19], 0x02);
+        assert_eq!(key_rg0.len(), 20);
+    }
+
+    /// Regression for Codex P1 on PR-6410 (positive coverage of the
+    /// fix): `rg_partition_prefix_len = 2` groups RGs by the
+    /// composite (metric_name, service) value, producing one output
+    /// row group per (metric_name, service) pair. Two RGs that share
+    /// metric_name but differ in service must NOT be folded into one
+    /// region.
+    #[tokio::test]
+    async fn test_streaming_merge_with_prefix_len_two() {
+        let bytes = make_prefix_len_two_input(
+            &[
+                ("cpu.usage", "host-a"),
+                ("cpu.usage", "host-b"),
+                ("memory.used", "host-a"),
+            ],
+            20,
+        );
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge with prefix_len = 2");
+        assert_eq!(outputs.len(), 1, "expected one output file");
+        assert_eq!(outputs[0].num_rows, 60, "20 × 3 input RGs = 60 rows");
+
+        let bytes_out = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+        // Three distinct (metric_name, service) tuples → three output
+        // row groups. With prefix_len truncated to 1 (the pre-fix
+        // bug) the two cpu.usage RGs would have folded into one
+        // region and only two output RGs would be written.
+        assert_eq!(
+            reader.metadata().num_row_groups(),
+            3,
+            "three distinct (metric_name, service) pairs must produce three output RGs",
+        );
+        assert_eq!(outputs[0].num_row_groups, 3);
+
+        // F2 chunk-level verification: counting RGs and stamping a KV
+        // is not enough — the OUTPUT's row groups must actually be
+        // aligned on the composite (metric_name, service) prefix.
+        // `assert_unique_rg_prefix_keys` enforces PA-1 (intra-RG
+        // constancy) + PA-3 (inter-RG uniqueness) by reading the
+        // chunk-level statistics.
+        assert_unique_rg_prefix_keys(
+            reader.metadata(),
+            "metric_name|service|-timestamp_secs/V2",
+            2,
+            "test_streaming_merge_with_prefix_len_two output",
+        )
+        .expect("streaming engine output must satisfy PA-1 + PA-3 on the prefix columns");
+    }
+
+    /// Regression for Codex finding #1 on PR-6410: when one input
+    /// file has two RGs sharing the same composite prefix key, the
+    /// streaming engine must reject up-front. Without the check,
+    /// `process_region` keys `sort_col_batches` by input_idx, so the
+    /// second RG silently overwrites the first while
+    /// `Region::total_rows` keeps counting both — rows would be
+    /// dropped and the body-col / sort-col mapping would be off by a
+    /// full RG.
+    ///
+    /// The fixture passes `("cpu.usage", "host-a")` twice, producing
+    /// an input with two RGs that have identical (metric_name,
+    /// service) statistics. The merge must bail with a clear error
+    /// pointing at the duplicate.
+    #[tokio::test]
+    async fn test_streaming_merge_rejects_duplicate_prefix_rgs_in_one_input() {
+        let bytes = make_prefix_len_two_input(
+            &[
+                ("cpu.usage", "host-a"),
+                ("cpu.usage", "host-a"), // duplicate prefix key
+                ("memory.used", "host-a"),
+            ],
+            20,
+        );
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let err = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect_err("must reject input with duplicate prefix RGs");
+        let s = err.to_string();
+        assert!(
+            s.contains("sharing a prefix key with an earlier RG"),
+            "expected duplicate-prefix error, got: {s}",
+        );
+        assert!(
+            s.contains("input 0"),
+            "error should identify which input is bad, got: {s}",
+        );
+    }
+
+    /// Build a single-RG parquet file with a nullable `metric_name`
+    /// prefix column. `values` is one entry per row (`None` for a
+    /// null cell). `prefix_len` is stamped in the file metadata so
+    /// the streaming merge treats the file as prefix-aligned.
+    fn make_nullable_prefix_input_single_rg(values: &[Option<&str>], prefix_len: u32) -> Bytes {
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", DataType::Utf8, true),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+        let n = values.len();
+        let metric_name: ArrayRef = Arc::new(StringArray::from(
+            values
+                .iter()
+                .map(|v| v.map(str::to_string))
+                .collect::<Vec<_>>(),
+        ));
+        let timestamps: Vec<u64> = (0..n as u64).map(|i| 1_700_000_000 + i).collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let series_bytes: Vec<Vec<u8>> = (0..n as u64).map(|i| i.to_be_bytes().to_vec()).collect();
+        let series_refs: Vec<&[u8]> = series_bytes.iter().map(|v| v.as_slice()).collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(series_refs));
+        let value: ArrayRef = Arc::new(Float64Array::from(
+            (0..n).map(|i| i as f64).collect::<Vec<_>>(),
+        ));
+        let batch = RecordBatch::try_new(
+            schema.clone(),
+            vec![metric_name, timestamp_secs, sorted_series, value],
+        )
+        .expect("test batch");
+
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            row_group_size: n.max(1),
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "metric_name|-timestamp_secs/V2".to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+            KeyValue::new(
+                PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+                prefix_len.to_string(),
+            ),
+        ];
+        let sorting_cols = vec![parquet::file::metadata::SortingColumn {
+            column_idx: 0,
+            descending: false,
+            nulls_first: false,
+        }];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            sorting_cols,
+            Some(kvs),
+            &["metric_name".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut buf, schema, Some(props)).expect("arrow writer");
+        writer.write(&batch).expect("write");
+        writer.close().expect("close");
+        Bytes::from(buf)
+    }
+
+    /// Regression for Codex P1 on PR #6424 (null prefix handling): an
+    /// RG with mixed null + non-null cells on a nullable prefix
+    /// column is NOT prefix-aligned (PA-1). The legacy `min == max`
+    /// check would silently accept it because Parquet stats hide
+    /// nulls from min/max, so a single non-null cell + N nulls
+    /// reports `min == max == non_null_value`. The fix reads
+    /// `null_count` from stats and bails when it's non-zero.
+    #[tokio::test]
+    async fn test_mixed_null_and_value_prefix_rg_rejected() {
+        // 1 RG, 4 rows: 3× "cpu.usage" + 1× null. Stats record
+        // `min == max == "cpu.usage"`, `null_count == 1`.
+        let bytes = make_nullable_prefix_input_single_rg(
+            &[
+                Some("cpu.usage"),
+                Some("cpu.usage"),
+                None,
+                Some("cpu.usage"),
+            ],
+            1,
+        );
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let err = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect_err("mixed null + non-null must be rejected");
+        let s = format!("{err:#}");
+        assert!(
+            s.contains("NOT prefix-aligned") && s.contains("nulls plus"),
+            "expected PA-1 mixed-null error, got: {s}",
+        );
+    }
+
+    /// Regression for Codex P1 on PR #6424 (null prefix handling):
+    /// an all-null prefix RG is logically aligned (its single prefix
+    /// "value" is null) and must successfully merge. The
+    /// composite-prefix encoding skips the column entirely for the
+    /// all-null RG; in a multi-input setup that mixes non-null and
+    /// all-null contributions on the same prefix column, BTreeMap
+    /// iteration order puts the all-null region AFTER any
+    /// non-null-prefix region (nulls-last), matching `sorted_series`'s
+    /// row-level null-skip convention.
+    #[tokio::test]
+    async fn test_all_null_prefix_rg_groups_into_separate_region_sorted_last() {
+        // Input A: one RG, metric_name = "cpu.usage" for all rows.
+        // Input B: one RG, metric_name = NULL for all rows.
+        // prefix_len = 1, merge into a single output. Each input
+        // contributes its own region; the all-null region should
+        // sort after the non-null region, so the merged output's
+        // RG 0 carries the non-null metric and RG 1 carries the
+        // all-null one.
+        let bytes_a = make_nullable_prefix_input_single_rg(
+            &[Some("cpu.usage"), Some("cpu.usage"), Some("cpu.usage")],
+            1,
+        );
+        let bytes_b = make_nullable_prefix_input_single_rg(&[None, None, None], 1);
+        let inputs: Vec<Box<dyn ColumnPageStream>> =
+            vec![open_stream(bytes_a).await, open_stream(bytes_b).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("all-null prefix RG must merge cleanly via column-skip encoding");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, 6, "3 non-null + 3 null = 6 rows total");
+
+        let bytes_out = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+        assert_eq!(
+            reader.metadata().num_row_groups(),
+            2,
+            "two regions (non-null + all-null) → two output RGs",
+        );
+
+        // Read the output back as Arrow and confirm RG 0 is the
+        // non-null region (rows have `metric_name = "cpu.usage"`)
+        // and RG 1 is the all-null region (rows have
+        // `metric_name = NULL`). This pins the nulls-last ordering
+        // produced by the composite-prefix's column-skip encoding.
+        use arrow::array::StringArray;
+        let combined = read_output_to_record_batch(&outputs[0].path);
+        let mn_idx = combined.schema().index_of("metric_name").expect("mn col");
+        let arr = combined
+            .column(mn_idx)
+            .as_any()
+            .downcast_ref::<StringArray>();
+        let arr = arr.expect("metric_name should decode as StringArray");
+        // First 3 rows are the non-null region, last 3 are all-null.
+        for i in 0..3 {
+            assert!(arr.is_valid(i), "row {i} should be non-null");
+            assert_eq!(arr.value(i), "cpu.usage");
+        }
+        for i in 3..6 {
+            assert!(
+                arr.is_null(i),
+                "row {i} should be null (all-null region sorts last)"
+            );
+        }
+    }
+
+    /// Regression for Codex P1 on PR #6424: when a top-level region
+    /// exactly fills the current output (so `remaining_in_current == 0`)
+    /// AND the following prefix-aligned region needs splitting, the
+    /// split's first-sub-region budget must be computed against the
+    /// rolled-over fresh output's full target — NOT the stale zero
+    /// remainder. Before the fix, `split_region_at_sorted_series` cut
+    /// after the first sorted_series run, producing a tiny leftover
+    /// plus a large continuation that both shared the parent region's
+    /// prefix key and landed in the same new output, tripping the
+    /// PA-3 duplicate-prefix-RG check in `finalize_output`.
+    ///
+    /// Setup: three RGs of 50 rows each, distinct (metric, service)
+    /// prefixes, num_outputs = 3 → target_per_output = 50. Each region
+    /// should land cleanly in its own output with a single RG.
+    #[tokio::test]
+    async fn test_region_exactly_fills_output_does_not_split_next_aligned_region() {
+        let bytes = make_prefix_len_two_input(
+            &[
+                ("cpu.usage", "host-a"),
+                ("cpu.usage", "host-b"),
+                ("memory.used", "host-a"),
+            ],
+            50,
+        );
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(3))
+            .await
+            .expect("merge must succeed; pre-fix this tripped PA-3 on output 1");
+
+        assert_eq!(outputs.len(), 3, "three regions → three outputs");
+        for (i, out) in outputs.iter().enumerate() {
+            assert_eq!(out.num_rows, 50, "output {i} should have exactly 50 rows");
+            assert_eq!(
+                out.num_row_groups, 1,
+                "output {i} should have a single RG (no spurious split)",
+            );
+        }
+
+        // PA-3 verification: each output's single RG carries a
+        // unique composite prefix key.
+        for (i, out) in outputs.iter().enumerate() {
+            let bytes_out = std::fs::read(&out.path).expect("read");
+            let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+            assert_unique_rg_prefix_keys(
+                reader.metadata(),
+                "metric_name|service|-timestamp_secs/V2",
+                2,
+                &format!("output {i}"),
+            )
+            .expect("each output RG must satisfy PA-1 + PA-3 on prefix");
+        }
+    }
+
+    /// Regression for Codex P1 on PR #6428: companion to
+    /// `test_region_exactly_fills_output_does_not_split_next_aligned_region`.
+    /// The earlier fix handled the split *decision* (use the rolled-
+    /// over output's full budget) but missed the split *assignment*
+    /// loop in `process_split_region_col_outer`'s setup: it
+    /// initialized `active_output_idx` / `active_rows` from the stale
+    /// `current_output_idx` / `current_output_rows`. The inner
+    /// `needs_new_writer` check guards on `!chunk_assignments.is_empty()`,
+    /// so the first chunk never rolled over — it was appended to the
+    /// already-full output, doubling its rows and shrinking the total
+    /// output count.
+    ///
+    /// Setup: prefix_len = 1, two metrics with very different sizes
+    /// (200 rows + 400 rows = 600 total). `num_outputs = 3` →
+    /// `target_per_output = 200`. Region A (metric_name = "alpha")
+    /// fills output 0 exactly. Region B (metric_name = "beta") needs
+    /// splitting: 400 rows > the rolled-over output's 200 budget.
+    /// Pre-fix the merge produced 2 outputs of 400 + 200 (output 0
+    /// overfilled, output 2 empty). Post-fix the merge produces 3
+    /// outputs of 200 + 200 + 200.
+    #[tokio::test]
+    async fn test_split_chunk_assignment_rolls_over_before_first_chunk() {
+        let bytes = make_prefix_len_one_input(&[("aaa.alpha", 200), ("bbb.beta", 400)]);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(3))
+            .await
+            .expect("merge must succeed");
+
+        // Pre-fix: outputs.len() == 2 (output 0 = 400 rows, output 1 =
+        // 200 rows). Post-fix: 3 outputs of ~200 rows each.
+        assert_eq!(
+            outputs.len(),
+            3,
+            "expected 3 outputs (one per target), got {} — chunk assignment didn't roll over \
+             before the first chunk",
+            outputs.len(),
+        );
+        let total_rows: usize = outputs.iter().map(|o| o.num_rows).sum();
+        assert_eq!(total_rows, 600, "row conservation");
+        for (i, out) in outputs.iter().enumerate() {
+            // All-non-empty: pre-fix output 2 was empty (the loop
+            // assigned both region-B sub-chunks within outputs 0+1).
+            assert!(
+                out.num_rows > 0,
+                "output {i} should be non-empty post-fix; got num_rows = {}",
+                out.num_rows,
+            );
+            // No output exceeds the target by more than one full
+            // sub-region. Pre-fix output 0 was 400 rows (2× target).
+            assert!(
+                out.num_rows <= 250,
+                "output {i} should not exceed target by more than one sub-region; got {}",
+                out.num_rows,
+            );
+        }
+    }
+
+    /// Build a single-RG fixture with multiple metric_names sorted
+    /// by metric_name then timestamp. The file has
+    /// `rg_partition_prefix_len = 0` (legacy) so the streaming merge
+    /// synthesizes prefix-aligned regions during the merge.
+    ///
+    /// `metrics` = list of (metric_name, num_rows). Rows are emitted
+    /// in order so the resulting batch is already sorted by
+    /// metric_name; each row gets a unique sorted_series identifier
+    /// derived from its position so the k-way merge has a well-
+    /// defined order even when other tag dimensions are degenerate.
+    fn make_multi_metric_single_rg_input(metrics: &[(&str, usize)]) -> Bytes {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("service", dict_type, true),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let total_rows: usize = metrics.iter().map(|(_, n)| *n).sum();
+        let mut metric_keys: Vec<i32> = Vec::with_capacity(total_rows);
+        let mut metric_values_vec: Vec<&str> = Vec::with_capacity(metrics.len());
+        let mut timestamps: Vec<u64> = Vec::with_capacity(total_rows);
+        let mut series_bytes: Vec<Vec<u8>> = Vec::with_capacity(total_rows);
+        let mut tsids: Vec<i64> = Vec::with_capacity(total_rows);
+        let mut values: Vec<f64> = Vec::with_capacity(total_rows);
+        let mut row_idx: u64 = 0;
+        for (metric_idx, (name, num)) in metrics.iter().enumerate() {
+            metric_values_vec.push(name);
+            // -timestamp_secs/V2 in the sort schema means timestamps
+            // DESC within a metric.
+            for r in 0..*num {
+                metric_keys.push(metric_idx as i32);
+                timestamps.push(1_700_000_000 + ((*num - r) as u64));
+                series_bytes.push(row_idx.to_be_bytes().to_vec());
+                tsids.push(1000 + row_idx as i64);
+                values.push(row_idx as f64);
+                row_idx += 1;
+            }
+        }
+        let metric_names_arr = StringArray::from(metric_values_vec);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_names_arr),
+            )
+            .expect("dict array"),
+        );
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let series_refs: Vec<&[u8]> = series_bytes.iter().map(|v| v.as_slice()).collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(series_refs));
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; total_rows]));
+        let service: ArrayRef = {
+            let svc_keys: Vec<Option<i32>> = (0..total_rows as i32).map(|i| Some(i % 3)).collect();
+            let svc_values = StringArray::from(vec!["api", "db", "cache"]);
+            Arc::new(
+                DictionaryArray::<Int32Type>::try_new(
+                    arrow::array::Int32Array::from(svc_keys),
+                    Arc::new(svc_values),
+                )
+                .expect("svc dict"),
+            )
+        };
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+
+        let batch = RecordBatch::try_new(
+            schema.clone(),
+            vec![
+                metric_name,
+                timestamp_secs,
+                sorted_series,
+                metric_type,
+                service,
+                timeseries_id,
+                value,
+            ],
+        )
+        .expect("batch");
+
+        // Write as a single-RG legacy file (prefix_len absent → 0).
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            ..ParquetWriterConfig::default()
+        };
+        let sort_fields = "metric_name|-timestamp_secs/V2";
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                sort_fields.to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+        ];
+        let sort_field_names = vec!["metric_name".to_string(), "timestamp_secs".to_string()];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            Vec::new(),
+            Some(kvs),
+            &sort_field_names,
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut buf, schema, Some(props)).expect("arrow writer");
+        writer.write(&batch).expect("write");
+        writer.close().expect("close");
+        Bytes::from(buf)
+    }
+
+    /// Regression for Codex P2 on PR-6410: prefix_len=0 inputs +
+    /// `num_outputs > 1` previously folded into a single oversized
+    /// output because the region-to-output assigner only split at
+    /// region boundaries and prefix_len=0 produces exactly one region.
+    /// The engine now subdivides the single region at sorted_series
+    /// transitions so it can honor `num_outputs`. The output inherits
+    /// the input's `rg_partition_prefix_len` (= 0 here) — the engine
+    /// does not declare a prefix it can't unconditionally guarantee.
+    #[tokio::test]
+    async fn test_prefix_len_zero_multi_output_splits_at_sorted_series() {
+        // 6 distinct metric_names × 50 rows = 300 rows total.
+        // num_outputs = 3 → target 100 rows/output. Splits land at
+        // sorted_series transitions (each row has a unique
+        // sorted_series in this fixture).
+        let metrics = [
+            ("aaa.alpha", 50usize),
+            ("bbb.beta", 50),
+            ("ccc.gamma", 50),
+            ("ddd.delta", 50),
+            ("eee.epsilon", 50),
+            ("fff.zeta", 50),
+        ];
+        let bytes = make_multi_metric_single_rg_input(&metrics);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(3))
+            .await
+            .expect("merge");
+
+        assert_eq!(
+            outputs.len(),
+            3,
+            "prefix_len=0 input + num_outputs=3 must produce 3 output files; got {} ({} rows \
+             total)",
+            outputs.len(),
+            outputs.iter().map(|o| o.num_rows).sum::<usize>(),
+        );
+        let total: usize = outputs.iter().map(|o| o.num_rows).sum();
+        assert_eq!(total, 300, "rows preserved (MC-1)");
+
+        for out in &outputs {
+            let bytes_out = std::fs::read(&out.path).expect("read");
+            let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+            // Inherits input's prefix_len = 0 → KV absent.
+            let prefix_kv = reader
+                .metadata()
+                .file_metadata()
+                .key_value_metadata()
+                .and_then(|kvs| {
+                    kvs.iter()
+                        .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                        .and_then(|k| k.value.clone())
+                });
+            assert!(
+                prefix_kv.is_none(),
+                "prefix_len=0 input must produce prefix_len=0 output (KV absent); got \
+                 {prefix_kv:?}",
+            );
+            // Each sub-region produces one output RG.
+            assert_eq!(
+                reader.metadata().num_row_groups(),
+                1,
+                "expected one output RG per sub-region; got {}",
+                reader.metadata().num_row_groups(),
+            );
+        }
+    }
+
+    /// Giant single-metric input: prefix_len=0, only one metric_name,
+    /// so there are NO metric_name transitions in the merge order.
+    /// Splitting must still honor `num_outputs` by breaking at
+    /// sorted_series transitions inside the single metric. Confirms
+    /// the engine does not require prefix-value transitions to
+    /// honor the requested output count.
+    #[tokio::test]
+    async fn test_prefix_len_zero_giant_single_metric_splits_into_multiple_outputs() {
+        let metrics = [("cpu.usage", 200usize)];
+        let bytes = make_multi_metric_single_rg_input(&metrics);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(2))
+            .await
+            .expect("merge");
+
+        assert_eq!(
+            outputs.len(),
+            2,
+            "one metric × num_outputs=2 must still split (at sorted_series boundaries); got {}",
+            outputs.len(),
+        );
+        let total: usize = outputs.iter().map(|o| o.num_rows).sum();
+        assert_eq!(total, 200, "rows preserved");
+        // Each output is balanced near 100 rows.
+        for out in &outputs {
+            assert!(
+                out.num_rows > 0 && out.num_rows <= 200,
+                "output rows = {}",
+                out.num_rows
+            );
+        }
+    }
+
+    /// Prefix_len=0 + num_outputs=1: the whole region fits in one
+    /// output, no splitting needed. `process_region` drains
+    /// internally (no pre-drain) and produces a single output RG.
+    #[tokio::test]
+    async fn test_prefix_len_zero_single_output_is_single_rg() {
+        let metrics = [
+            ("cpu.usage", 40usize),
+            ("memory.used", 40),
+            ("net.bytes", 40),
+        ];
+        let bytes = make_multi_metric_single_rg_input(&metrics);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, 120);
+
+        let bytes_out = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+        assert_eq!(
+            reader.metadata().num_row_groups(),
+            1,
+            "single output, no split → single RG (engine does not synthesize prefix alignment \
+             when inputs declare prefix_len=0)",
+        );
+        let prefix_kv = reader
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .and_then(|kvs| {
+                kvs.iter()
+                    .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                    .and_then(|k| k.value.clone())
+            });
+        assert!(
+            prefix_kv.is_none(),
+            "output must inherit input's prefix_len = 0 (KV absent); got {prefix_kv:?}",
+        );
+    }
+
+    /// Build a multi-RG fixture where each RG has a single
+    /// (metric_name, service) tuple. `rg_partition_prefix_len = 2`
+    /// declares the alignment.
+    fn make_prefix_len_two_input(rgs: &[(&str, &str)], rows_per_rg: usize) -> Bytes {
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", DataType::Utf8, false),
+            Field::new("service", DataType::Utf8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let make_batch = |metric: &str, service: &str, start_series: u64| -> RecordBatch {
+            let metric_name: ArrayRef = Arc::new(StringArray::from(vec![metric; rows_per_rg]));
+            let svc: ArrayRef = Arc::new(StringArray::from(vec![service; rows_per_rg]));
+            let timestamps: Vec<u64> = (0..rows_per_rg as u64).map(|i| 1_700_000_000 + i).collect();
+            let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+            let series: Vec<Vec<u8>> = (0..rows_per_rg as u64)
+                .map(|i| (start_series + i).to_be_bytes().to_vec())
+                .collect();
+            let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+                series.iter().map(|v| v.as_slice()).collect::<Vec<_>>(),
+            ));
+            let value: ArrayRef = Arc::new(Float64Array::from(
+                (0..rows_per_rg).map(|i| i as f64).collect::<Vec<_>>(),
+            ));
+            RecordBatch::try_new(
+                schema.clone(),
+                vec![metric_name, svc, timestamp_secs, sorted_series, value],
+            )
+            .expect("batch")
+        };
+
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            row_group_size: rows_per_rg, // one RG per batch
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "metric_name|service|-timestamp_secs/V2".to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+            KeyValue::new(
+                PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+                "2".to_string(),
+            ),
+        ];
+        let sorting_cols = vec![
+            parquet::file::metadata::SortingColumn {
+                column_idx: 0,
+                descending: false,
+                nulls_first: false,
+            },
+            parquet::file::metadata::SortingColumn {
+                column_idx: 1,
+                descending: false,
+                nulls_first: false,
+            },
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            sorting_cols,
+            Some(kvs),
+            &["metric_name".to_string(), "service".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer =
+            ArrowWriter::try_new(&mut buf, schema.clone(), Some(props)).expect("arrow writer");
+        for (i, (metric, service)) in rgs.iter().enumerate() {
+            let batch = make_batch(metric, service, (i as u64) * rows_per_rg as u64);
+            writer.write(&batch).expect("write");
+        }
+        writer.close().expect("close");
+        Bytes::from(buf)
+    }
+
+    /// End-to-end regression for DESC prefix columns. Three RGs with
+    /// the same metric_name (ASC) and distinct `env` values; sort
+    /// schema declares env DESC. The input file must itself be
+    /// DESC-sorted on env (RGs in physical order staging → prod →
+    /// dev) because the streaming engine processes each input's RGs
+    /// in physical order; the BTreeMap-driven region order is the
+    /// thing the composite key + `invert_for_descending` controls,
+    /// and it must agree with the input's physical RG order for the
+    /// engine to drain sort cols in lockstep with the body cols.
+    ///
+    /// Regions must therefore come out in DESC order on env:
+    /// staging → prod → dev. Without the `invert_for_descending`
+    /// step the BTreeMap would emit dev → prod → staging, which
+    /// would disagree with the physical RG order and the engine
+    /// would bail with "page from rg 0 while draining sort cols of
+    /// rg 2".
+    #[tokio::test]
+    async fn test_streaming_merge_with_desc_prefix_col() {
+        let bytes = make_prefix_len_two_input_with_directions(
+            // (metric, env, marker_value). Each RG's body `value`
+            // column is filled with `marker_value` so we can identify
+            // which RG produced each output row group. Order is
+            // env-DESC physical: staging, prod, dev.
+            &[
+                ("cpu.usage", "staging", 3.0),
+                ("cpu.usage", "prod", 2.0),
+                ("cpu.usage", "dev", 1.0),
+            ],
+            20,
+            /* env_descending */ true,
+        );
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge with env DESC");
+        assert_eq!(outputs.len(), 1);
+
+        // Read the output and inspect the per-RG metric values: each
+        // RG should be filled with a single marker, and the RG order
+        // must be staging (3.0) → prod (2.0) → dev (1.0).
+        let bytes_out = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+        let meta = reader.metadata();
+        assert_eq!(
+            meta.num_row_groups(),
+            3,
+            "three distinct env values → three RGs"
+        );
+
+        let merged = read_output_to_record_batch(&outputs[0].path);
+        let value = merged
+            .column(merged.schema().index_of("value").expect("value"))
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .expect("Float64");
+        // RG 0 spans rows 0..20, RG 1 spans 20..40, RG 2 spans 40..60.
+        // Within each RG the marker_value is constant.
+        let first_block = value.value(0);
+        let second_block = value.value(20);
+        let third_block = value.value(40);
+        assert!(
+            (first_block - 3.0).abs() < 1e-9,
+            "first output RG should be 'staging' (marker 3.0), got {first_block}",
+        );
+        assert!(
+            (second_block - 2.0).abs() < 1e-9,
+            "second output RG should be 'prod' (marker 2.0), got {second_block}",
+        );
+        assert!(
+            (third_block - 1.0).abs() < 1e-9,
+            "third output RG should be 'dev' (marker 1.0), got {third_block}",
+        );
+
+        // F2 chunk-level verification: each output RG must be aligned
+        // on (metric_name, -env). PA-1 + PA-3 read from chunk stats.
+        assert_unique_rg_prefix_keys(
+            reader.metadata(),
+            "metric_name|-env|-timestamp_secs/V2",
+            2,
+            "test_streaming_merge_with_desc_prefix_col output",
+        )
+        .expect("DESC prefix output must satisfy PA-1 + PA-3");
+    }
+
+    /// Regression for the composite-key encoding when ASC and DESC
+    /// columns are interleaved. metric_name ASC + env DESC: composite
+    /// keys for ("cpu.usage", "dev") and ("cpu.usage", "prod") must
+    /// put 'prod' before 'dev' (because prod > dev in ASC lex, so
+    /// prod's DESC encoding sorts smaller).
+    #[test]
+    fn test_extract_rg_composite_prefix_key_mixed_directions() {
+        let bytes = make_prefix_len_two_input_with_directions(
+            &[("cpu.usage", "dev", 0.0), ("cpu.usage", "prod", 0.0)],
+            5,
+            /* env_descending */ true,
+        );
+        let reader = SerializedFileReader::new(Bytes::from(bytes.to_vec())).expect("ser");
+        let metadata = reader.metadata();
+        let prefix_cols =
+            find_prefix_parquet_col_indices(metadata, "metric_name|-env|-timestamp_secs/V2", 2, 0)
+                .expect("resolve");
+        // Sanity: the second prefix column must be flagged DESC.
+        assert!(
+            prefix_cols[1]
+                .as_ref()
+                .expect("env present in this fixture")
+                .descending,
+            "env must be parsed as DESC from sort schema",
+        );
+
+        let key_dev = extract_rg_composite_prefix_key(metadata, 0, &prefix_cols, 0).expect("dev");
+        let key_prod = extract_rg_composite_prefix_key(metadata, 1, &prefix_cols, 0).expect("prod");
+        // metric_name is the same; env differs. DESC on env means
+        // 'prod' (larger lex) should encode to LESS-THAN 'dev', so
+        // the BTreeMap iterates prod first.
+        assert!(
+            key_prod < key_dev,
+            "with env DESC, composite key for 'prod' must lex-sort before 'dev'",
+        );
+    }
+
+    /// MS-2: a file whose physical RG order disagrees with the
+    /// composite-key encoding's derived order must be rejected
+    /// upfront at `extract_regions_from_metadata` time, not bail
+    /// later from inside `process_region`. Construct an input that declares
+    /// env DESC but physically writes RGs in ASC env order — the
+    /// BTreeMap region iteration will visit RG 2 (env DESC = "dev",
+    /// largest in DESC encoding ... wait, no — DESC means largest
+    /// first, so env "staging" should be first ...).
+    ///
+    /// Concretely: RGs written physically as `[dev, prod, staging]`
+    /// with sort direction declared DESC. DESC iteration order is
+    /// `[staging (RG 2), prod (RG 1), dev (RG 0)]`. The first region
+    /// the engine would try to drain is RG 2, but the input stream
+    /// reaches RG 0 first. MS-2 must reject this at
+    /// `extract_regions_from_metadata` time.
+    #[tokio::test]
+    async fn test_ms2_region_order_disagrees_with_physical_rg_order_rejected() {
+        let bytes = make_prefix_len_two_input_with_directions(
+            // Physical order ASC on env: dev, prod, staging. But the
+            // sort schema below declares env DESC.
+            &[
+                ("cpu.usage", "dev", 1.0),
+                ("cpu.usage", "prod", 2.0),
+                ("cpu.usage", "staging", 3.0),
+            ],
+            10,
+            /* env_descending */ true,
+        );
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let err = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect_err("region order vs physical RG order mismatch must be rejected");
+        let s = err.to_string();
+        assert!(
+            s.contains("disagrees with input") && s.contains("physical row order"),
+            "expected MS-2 rejection message, got: {s}",
+        );
+    }
+
+    /// Build a multi-RG fixture whose second sort col can be flagged
+    /// DESC. Sort schema written into KV metadata: either
+    /// `metric_name|env|-timestamp_secs/V2` or
+    /// `metric_name|-env|-timestamp_secs/V2`.
+    fn make_prefix_len_two_input_with_directions(
+        rgs: &[(&str, &str, f64)],
+        rows_per_rg: usize,
+        env_descending: bool,
+    ) -> Bytes {
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", DataType::Utf8, false),
+            Field::new("env", DataType::Utf8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let make_batch = |metric: &str, env: &str, marker: f64, start_series: u64| -> RecordBatch {
+            let metric_name: ArrayRef = Arc::new(StringArray::from(vec![metric; rows_per_rg]));
+            let env_arr: ArrayRef = Arc::new(StringArray::from(vec![env; rows_per_rg]));
+            let timestamps: Vec<u64> = (0..rows_per_rg as u64)
+                // Timestamps DESC within the RG to match the DESC sort.
+                .map(|i| 1_700_000_000 + (rows_per_rg as u64 - i))
+                .collect();
+            let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+            let series: Vec<Vec<u8>> = (0..rows_per_rg as u64)
+                .map(|i| (start_series + i).to_be_bytes().to_vec())
+                .collect();
+            let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+                series.iter().map(|v| v.as_slice()).collect::<Vec<_>>(),
+            ));
+            let value: ArrayRef = Arc::new(Float64Array::from(vec![marker; rows_per_rg]));
+            RecordBatch::try_new(
+                schema.clone(),
+                vec![metric_name, env_arr, timestamp_secs, sorted_series, value],
+            )
+            .expect("batch")
+        };
+
+        let env_token = if env_descending { "-env" } else { "env" };
+        let sort_fields = format!("metric_name|{env_token}|-timestamp_secs/V2");
+
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            row_group_size: rows_per_rg,
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(PARQUET_META_SORT_FIELDS.to_string(), sort_fields),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+            KeyValue::new(
+                PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+                "2".to_string(),
+            ),
+        ];
+        let sorting_cols = vec![
+            parquet::file::metadata::SortingColumn {
+                column_idx: 0,
+                descending: false,
+                nulls_first: false,
+            },
+            parquet::file::metadata::SortingColumn {
+                column_idx: 1,
+                descending: env_descending,
+                nulls_first: false,
+            },
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            sorting_cols,
+            Some(kvs),
+            &["metric_name".to_string(), "env".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer =
+            ArrowWriter::try_new(&mut buf, schema.clone(), Some(props)).expect("arrow writer");
+        for (i, (metric, env, marker)) in rgs.iter().enumerate() {
+            let batch = make_batch(metric, env, *marker, (i as u64) * rows_per_rg as u64);
+            writer.write(&batch).expect("write");
+        }
+        writer.close().expect("close");
+        Bytes::from(buf)
+    }
+
+    /// `extract_aligned_prefix_value` rejects an RG whose prefix
+    /// column has `min != max` — those RGs are not actually aligned
+    /// on the prefix value and grouping them into one region would be
+    /// silently wrong.
+    #[test]
+    fn test_composite_key_rejects_non_aligned_rg() {
+        // A single RG whose `metric_name` carries two distinct values:
+        // min ("cpu.usage") != max ("memory.used"). The composite-key
+        // extractor must refuse to mint a key for this RG.
+        let bytes = make_misaligned_metric_name_input(&[("cpu.usage", "memory.used")], 20);
+        let reader = SerializedFileReader::new(bytes).expect("ser");
+        let metadata = reader.metadata();
+        let prefix_cols =
+            find_prefix_parquet_col_indices(metadata, "metric_name|-timestamp_secs/V2", 1, 0)
+                .expect("resolve");
+        let err = extract_rg_composite_prefix_key(metadata, 0, &prefix_cols, 0)
+            .expect_err("RG with min != max on prefix col must be rejected");
+        let s = err.to_string();
+        assert!(
+            s.contains("NOT prefix-aligned"),
+            "expected misalignment error, got: {s}",
+        );
+    }
+
+    /// Build a single-RG fixture whose `metric_name` column contains
+    /// two distinct values within the same RG. Used to exercise the
+    /// `min != max` rejection path in
+    /// `extract_aligned_prefix_value`.
+    fn make_misaligned_metric_name_input(names: &[(&str, &str)], rows_per_run: usize) -> Bytes {
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", DataType::Utf8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let total = rows_per_run * 2 * names.len();
+        let mut metric_values: Vec<&str> = Vec::with_capacity(total);
+        for (a, b) in names {
+            for _ in 0..rows_per_run {
+                metric_values.push(*a);
+            }
+            for _ in 0..rows_per_run {
+                metric_values.push(*b);
+            }
+        }
+        let metric_name: ArrayRef = Arc::new(StringArray::from(metric_values));
+        let timestamps: Vec<u64> = (0..total as u64).map(|i| 1_700_000_000 + i).collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let series: Vec<Vec<u8>> = (0..total as u64)
+            .map(|i| i.to_be_bytes().to_vec())
+            .collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+            series.iter().map(|v| v.as_slice()).collect::<Vec<_>>(),
+        ));
+        let value: ArrayRef = Arc::new(Float64Array::from(
+            (0..total).map(|i| i as f64).collect::<Vec<_>>(),
+        ));
+
+        let batch = RecordBatch::try_new(
+            schema.clone(),
+            vec![metric_name, timestamp_secs, sorted_series, value],
+        )
+        .expect("batch");
+
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "metric_name|-timestamp_secs/V2".to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+            KeyValue::new(
+                PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+                "1".to_string(),
+            ),
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            Vec::new(),
+            Some(kvs),
+            &["metric_name".to_string(), "timestamp_secs".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer =
+            ArrowWriter::try_new(&mut buf, schema.clone(), Some(props)).expect("arrow writer");
+        writer.write(&batch).expect("write");
+        writer.close().expect("close");
+        Bytes::from(buf)
+    }
+
+    /// Build a parquet fixture with TWO row groups, each containing
+    /// rows of one distinct metric_name. RG 0 = "cpu.usage" × 30 rows,
+    /// RG 1 = "memory.used" × 30 rows. `rg_partition_prefix_len = 1`
+    /// declares metric_name alignment.
+    fn make_two_metric_aligned_input() -> Bytes {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("service", dict_type, true),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let make_batch = |metric_key: i32, start_series: u64, rows: usize| -> RecordBatch {
+            let metric_keys: Vec<i32> = vec![metric_key; rows];
+            let metric_values = StringArray::from(vec!["cpu.usage", "memory.used"]);
+            let metric_name: ArrayRef = Arc::new(
+                DictionaryArray::<Int32Type>::try_new(
+                    arrow::array::Int32Array::from(metric_keys),
+                    Arc::new(metric_values),
+                )
+                .expect("dict"),
+            );
+            let timestamps: Vec<u64> = (0..rows as u64)
+                .map(|i| 1_700_000_000 + (rows as u64 - i))
+                .collect();
+            let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+            let mut series_bytes: Vec<Vec<u8>> = Vec::with_capacity(rows);
+            for i in 0..rows as u64 {
+                series_bytes.push((start_series + i).to_be_bytes().to_vec());
+            }
+            let series_refs: Vec<&[u8]> = series_bytes.iter().map(|v| v.as_slice()).collect();
+            let sorted_series: ArrayRef = Arc::new(BinaryArray::from(series_refs));
+            let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; rows]));
+            let svc_values = StringArray::from(vec!["api", "db", "cache"]);
+            let svc_keys: Vec<Option<i32>> = (0..rows as i32)
+                .map(|i| if i % 5 == 0 { None } else { Some(i % 3) })
+                .collect();
+            let service: ArrayRef = Arc::new(
+                DictionaryArray::<Int32Type>::try_new(
+                    arrow::array::Int32Array::from(svc_keys),
+                    Arc::new(svc_values),
+                )
+                .expect("dict"),
+            );
+            let tsids: Vec<i64> = (0..rows as i64).map(|i| 1000 + i).collect();
+            let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+            let values: Vec<f64> = (0..rows).map(|i| i as f64).collect();
+            let value: ArrayRef = Arc::new(Float64Array::from(values));
+
+            RecordBatch::try_new(
+                schema.clone(),
+                vec![
+                    metric_name,
+                    timestamp_secs,
+                    sorted_series,
+                    metric_type,
+                    service,
+                    timeseries_id,
+                    value,
+                ],
+            )
+            .expect("batch")
+        };
+
+        let batch_cpu = make_batch(0, 0, 30);
+        let batch_mem = make_batch(1, 100, 30);
+
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            row_group_size: 30, // one RG per metric_name
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "metric_name|-timestamp_secs/V2".to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+            // `prefix_len = 1` declares metric_name alignment.
+            KeyValue::new(
+                PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+                "1".to_string(),
+            ),
+        ];
+        let sorting_cols = vec![
+            parquet::file::metadata::SortingColumn {
+                column_idx: 0,
+                descending: false,
+                nulls_first: false,
+            },
+            parquet::file::metadata::SortingColumn {
+                column_idx: 1,
+                descending: true,
+                nulls_first: false,
+            },
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            sorting_cols,
+            Some(kvs),
+            &["metric_name".to_string(), "timestamp_secs".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut buf, schema, Some(props)).expect("arrow writer");
+        writer.write(&batch_cpu).expect("write cpu");
+        writer.write(&batch_mem).expect("write mem");
+        writer.close().expect("close");
+        Bytes::from(buf)
+    }
+
+    /// Regression for Codex P1 on PR-6409 (the empty-input half): a
+    /// zero-row-group input mixed with a populated one must not panic
+    /// the body-column path. Phase 0 already accepts empty inputs;
+    /// PR-6c.2's per-region engine only iterates `region.contributing`
+    /// inputs for body cols, but verify directly so any future change
+    /// that broadens the iteration is caught.
+    #[tokio::test]
+    async fn test_zero_row_input_mixed_with_non_empty() {
+        let empty = make_sorted_batch(0, 0);
+        let populated = make_sorted_batch(50, 0);
+        let bytes_empty = write_input_parquet(std::slice::from_ref(&empty), &[]);
+        let bytes_populated = write_input_parquet(std::slice::from_ref(&populated), &[]);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![
+            open_stream(bytes_empty).await,
+            open_stream(bytes_populated).await,
+        ];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge with mixed empty + populated inputs");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, 50);
+
+        let merged = read_output_to_record_batch(&outputs[0].path);
+        let value = merged
+            .column(merged.schema().index_of("value").expect("value col"))
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .expect("Float64");
+        for i in 0..50 {
+            assert!(
+                (value.value(i) - i as f64).abs() < 1e-9,
+                "row {i}: expected {i}, got {}",
+                value.value(i),
+            );
+        }
+    }
+
+    /// Write a fixture parquet file where each body column is forced
+    /// to span multiple parquet data pages by pinning a small
+    /// `data_page_row_count_limit`. The merge engine must read those
+    /// pages back via a single persistent `StreamDecoder` per input —
+    /// reconstructing the decoder for each `fill_page_cache_to_row`
+    /// call (the pre-fix behaviour) would reset the per-column
+    /// `rows_decoded` counter, making `DecodedPage::row_start` reset
+    /// to zero on every page after the first.
+    fn write_input_parquet_with_small_pages(
+        batches: &[RecordBatch],
+        data_page_row_count_limit: usize,
+    ) -> Bytes {
+        let schema = batches[0].schema();
+        // Lower `write_batch_size` and `data_page_size` so the arrow
+        // writer actually flushes pages at the row-count boundary.
+        // With the defaults (`write_batch_size = 64 KiB`,
+        // `data_page_size = 1 MiB`) the byte-size threshold doesn't
+        // trip for our small fixtures and the writer emits one page
+        // per column regardless of `data_page_row_count_limit`.
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            data_page_row_count_limit,
+            data_page_size: data_page_row_count_limit * 16,
+            write_batch_size: data_page_row_count_limit,
+            ..ParquetWriterConfig::default()
+        };
+        let sort_fields = "metric_name|-timestamp_secs/V2";
+        let sort_field_names = vec!["metric_name".to_string(), "timestamp_secs".to_string()];
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                sort_fields.to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+        ];
+        let sorting_cols = vec![
+            parquet::file::metadata::SortingColumn {
+                column_idx: schema.index_of("metric_name").expect("test schema") as i32,
+                descending: false,
+                nulls_first: false,
+            },
+            parquet::file::metadata::SortingColumn {
+                column_idx: schema.index_of("timestamp_secs").expect("test schema") as i32,
+                descending: true,
+                nulls_first: false,
+            },
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            sorting_cols,
+            Some(kvs),
+            &sort_field_names,
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut buf, schema, Some(props)).expect("arrow writer");
+        for b in batches {
+            writer.write(b).expect("test write");
+        }
+        writer.close().expect("test close");
+        Bytes::from(buf)
+    }
+
+    /// Regression for Codex P1 on PR-6409: when a body column spans
+    /// multiple input pages, every page-fetch must come from the same
+    /// long-lived `StreamDecoder` so its per-column `rows_decoded`
+    /// counter keeps producing absolute row offsets. Before the fix,
+    /// each `fill_page_cache_to_row` call instantiated a fresh decoder
+    /// whose counter started at zero — the *second* decoded page
+    /// reported `row_start = 0` and the page cache's
+    /// `(input_row - cache_start)` indexing landed on the wrong rows
+    /// (or panicked on out-of-bounds).
+    #[tokio::test]
+    async fn test_body_col_multi_input_page_preserves_row_start() {
+        // The bug only surfaces when `assemble_one_output_page` is
+        // called more than once per output (so `fill_page_cache_to_row`
+        // is invoked repeatedly with a non-empty cache). That means we
+        // need more than `OUTPUT_PAGE_ROWS` (=1024) total input rows
+        // for a single output. 2500 rows × 50-row input pages =
+        // 50 body-col pages per column chunk; three output pages of
+        // 1024+1024+452 each trigger a separate decoder advance.
+        let total_rows = 2500;
+        let batch = make_sorted_batch(total_rows, 0);
+        let bytes = write_input_parquet_with_small_pages(std::slice::from_ref(&batch), 50);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, total_rows);
+
+        let merged = read_output_to_record_batch(&outputs[0].path);
+        let value_idx = merged.schema().index_of("value").expect("value col");
+        let value = merged
+            .column(value_idx)
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .expect("value col is Float64");
+
+        // `make_sorted_batch` fills `value` with `i as f64` (see the
+        // fixture). Timestamps descend in input row order, matching the
+        // sort key (timestamp_secs DESC), so the merge with a single
+        // input is the identity permutation.
+        for i in 0..total_rows {
+            let expected = i as f64;
+            let got = value.value(i);
+            assert!(
+                (expected - got).abs() < 1e-9,
+                "row {i}: expected value {expected}, got {got} — body col page cache reported \
+                 wrong row_start",
+            );
+        }
+    }
+
+    /// Legacy multi-RG input (prefix_len=0, num_RGs>1) is rejected —
+    /// these must route through `LegacyInputAdapter` (PR-5).
+    #[tokio::test]
+    async fn test_legacy_multi_rg_input_rejected() {
+        // Force a 2-RG file by writing two batches with row_group_size = 1
+        // small enough to trip RG rollover.
+        let batch_a = make_sorted_batch(50, 0);
+        let batch_b = make_sorted_batch(50, 50);
+
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            row_group_size: 50, // force one RG per 50-row batch
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "metric_name|-timestamp_secs/V2".to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &batch_a.schema(),
+            Vec::new(),
+            Some(kvs),
+            &["metric_name".to_string(), "timestamp_secs".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer =
+            ArrowWriter::try_new(&mut buf, batch_a.schema(), Some(props)).expect("arrow writer");
+        writer.write(&batch_a).expect("write");
+        writer.write(&batch_b).expect("write");
+        writer.close().expect("close");
+        let bytes = Bytes::from(buf);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+        let tmp = TempDir::new().expect("tmpdir");
+        let err = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect_err("legacy multi-RG input must be rejected");
+        let s = err.to_string();
+        assert!(
+            s.contains("legacy multi-RG") || s.contains("PR-5 adapter"),
+            "expected legacy multi-RG rejection, got: {s}",
+        );
+    }
+
+    // ============================================================================
+    // MC-2 round-trip: every parquet physical type the decoder supports must
+    // survive the streaming merge unchanged.
+    // ============================================================================
+
+    /// Build a batch covering every primitive type the page decoder
+    /// supports, plus byte arrays, dictionary encoding, and list types.
+    /// Each row's `sorted_series` key uniquely identifies the row so
+    /// callers can build a `(key → tuple)` map for output comparison.
+    fn make_typed_round_trip_batch(num_rows: usize, key_offset: u64) -> RecordBatch {
+        use arrow::array::{
+            BooleanArray, Float32Array, Int8Array, Int16Array, Int32Array, LargeBinaryArray,
+            ListArray, UInt16Array, UInt32Array,
+        };
+        use arrow::buffer::OffsetBuffer;
+
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        // Body cols MUST be in lexicographic order — that's the
+        // storage convention the streaming engine assumes when
+        // iterating columns. Inputs that ship body cols in a
+        // different order trip "column ordering violated" mid-merge.
+        let schema = Arc::new(ArrowSchema::new(vec![
+            // sort cols
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            // body cols, all nullable, in lexicographic order. Null
+            // every 7th row to exercise the null-mask path.
+            Field::new("body_bool", DataType::Boolean, true),
+            Field::new("body_dict", dict_type, true),
+            Field::new("body_float32", DataType::Float32, true),
+            Field::new("body_float64", DataType::Float64, true),
+            Field::new("body_int16", DataType::Int16, true),
+            Field::new("body_int32", DataType::Int32, true),
+            Field::new("body_int64", DataType::Int64, true),
+            Field::new("body_int8", DataType::Int8, true),
+            Field::new("body_largebinary", DataType::LargeBinary, true),
+            // `List<Float64>` covers production-shaped histogram bucket
+            // columns. Outer + inner both non-nullable to match the
+            // decoder's PR-6a.2 contract.
+            Field::new(
+                "body_list_f64",
+                DataType::List(Arc::new(Field::new("item", DataType::Float64, false))),
+                false,
+            ),
+            Field::new("body_string", DataType::Utf8, true),
+            Field::new("body_uint16", DataType::UInt16, true),
+            Field::new("body_uint32", DataType::UInt32, true),
+            Field::new("body_uint64", DataType::UInt64, true),
+            Field::new("body_uint8", DataType::UInt8, true),
+        ]));
+
+        let is_null = |i: usize| i.is_multiple_of(7);
+
+        let metric_keys: Vec<i32> = vec![0; num_rows];
+        let metric_values = StringArray::from(vec!["cpu.usage"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .expect("metric dict"),
+        );
+
+        // Timestamps DESC within the run so the input is pre-sorted on
+        // (metric_name ASC, timestamp DESC) per the sort schema.
+        let timestamps: Vec<u64> = (0..num_rows as u64)
+            .map(|i| 1_700_000_000 + (num_rows as u64 - i))
+            .collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+
+        let key_bytes: Vec<Vec<u8>> = (0..num_rows as u64)
+            .map(|i| (key_offset + i).to_be_bytes().to_vec())
+            .collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+            key_bytes.iter().map(|v| v.as_slice()).collect::<Vec<_>>(),
+        ));
+
+        // Primitive value generators chosen to span each type's range
+        // including signed/unsigned boundaries.
+        let mk_opt = |i: usize, v: i64| if is_null(i) { None } else { Some(v) };
+        let body_int8: ArrayRef = Arc::new(Int8Array::from(
+            (0..num_rows)
+                .map(|i| mk_opt(i, (i as i64 % 251) - 125).map(|v| v as i8))
+                .collect::<Vec<_>>(),
+        ));
+        let body_uint8: ArrayRef = Arc::new(UInt8Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    if is_null(i) {
+                        None
+                    } else {
+                        Some((i % 255) as u8)
+                    }
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_int16: ArrayRef = Arc::new(Int16Array::from(
+            (0..num_rows)
+                .map(|i| mk_opt(i, (i as i64 % 30001) - 15000).map(|v| v as i16))
+                .collect::<Vec<_>>(),
+        ));
+        let body_uint16: ArrayRef = Arc::new(UInt16Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    if is_null(i) {
+                        None
+                    } else {
+                        Some((i % 60000) as u16)
+                    }
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_int32: ArrayRef = Arc::new(Int32Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    mk_opt(
+                        i,
+                        i as i64 * 0x100_0000i64 - i64::from(i32::MIN.unsigned_abs() / 2),
+                    )
+                    .map(|v| v as i32)
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_uint32: ArrayRef = Arc::new(UInt32Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    if is_null(i) {
+                        None
+                    } else {
+                        Some((i as u32).wrapping_mul(0xDEAD_BEEF))
+                    }
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_int64: ArrayRef = Arc::new(Int64Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    if is_null(i) {
+                        None
+                    } else {
+                        Some((i as i64).wrapping_mul(0x0123_4567_89AB_CDEF))
+                    }
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_uint64: ArrayRef = Arc::new(UInt64Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    if is_null(i) {
+                        None
+                    } else {
+                        Some((i as u64).wrapping_mul(0xFEDC_BA98_7654_3210))
+                    }
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_float32: ArrayRef = Arc::new(Float32Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    if is_null(i) {
+                        None
+                    } else {
+                        Some(i as f32 * 0.25 - 100.0)
+                    }
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_float64: ArrayRef = Arc::new(Float64Array::from(
+            (0..num_rows)
+                .map(|i| {
+                    if is_null(i) {
+                        None
+                    } else {
+                        Some(i as f64 * 0.5 - 1e6)
+                    }
+                })
+                .collect::<Vec<_>>(),
+        ));
+        let body_bool: ArrayRef = Arc::new(BooleanArray::from(
+            (0..num_rows)
+                .map(|i| if is_null(i) { None } else { Some(i % 3 == 0) })
+                .collect::<Vec<_>>(),
+        ));
+
+        let body_string_vals: Vec<Option<String>> = (0..num_rows)
+            .map(|i| {
+                if is_null(i) {
+                    None
+                } else {
+                    Some(format!("row-{i:08}-payload"))
+                }
+            })
+            .collect();
+        let body_string: ArrayRef = Arc::new(StringArray::from(body_string_vals));
+
+        let body_largebinary_vals: Vec<Option<Vec<u8>>> = (0..num_rows)
+            .map(|i| {
+                if is_null(i) {
+                    None
+                } else {
+                    Some(vec![(i & 0xFF) as u8; 1 + (i % 5)])
+                }
+            })
+            .collect();
+        let body_largebinary: ArrayRef = Arc::new(LargeBinaryArray::from_opt_vec(
+            body_largebinary_vals
+                .iter()
+                .map(|opt| opt.as_deref())
+                .collect(),
+        ));
+
+        // Dict body col cycles through a small set so the dict-encoding
+        // path is exercised end-to-end.
+        let dict_pool = ["api", "db", "cache", "auth", "billing"];
+        let dict_keys: Vec<Option<i32>> = (0..num_rows as i32)
+            .map(|i| {
+                if is_null(i as usize) {
+                    None
+                } else {
+                    Some(i % (dict_pool.len() as i32))
+                }
+            })
+            .collect();
+        let body_dict: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(dict_keys),
+                Arc::new(StringArray::from(dict_pool.to_vec())),
+            )
+            .expect("body_dict"),
+        );
+
+        // List<f64>: row i has a list of length (i % 4) with f64
+        // values derived from i and j. Outer + inner non-null (the
+        // decoder's PR-6a.2 list path requires both). Empty lists are
+        // still exercised on rows where (i % 4) == 0.
+        let mut list_offsets: Vec<i32> = Vec::with_capacity(num_rows + 1);
+        let mut list_values: Vec<f64> = Vec::new();
+        list_offsets.push(0);
+        for i in 0..num_rows {
+            for j in 0..(i % 4) {
+                list_values.push((i * 10 + j) as f64 + 0.125);
+            }
+            list_offsets.push(list_values.len() as i32);
+        }
+        let list_inner_field = Arc::new(Field::new("item", DataType::Float64, false));
+        let list_inner: ArrayRef = Arc::new(Float64Array::from(list_values));
+        let body_list_f64: ArrayRef = Arc::new(ListArray::new(
+            list_inner_field,
+            OffsetBuffer::new(arrow::buffer::ScalarBuffer::from(list_offsets)),
+            list_inner,
+            None,
+        ));
+
+        RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                timestamp_secs,
+                sorted_series,
+                body_bool,
+                body_dict,
+                body_float32,
+                body_float64,
+                body_int16,
+                body_int32,
+                body_int64,
+                body_int8,
+                body_largebinary,
+                body_list_f64,
+                body_string,
+                body_uint16,
+                body_uint32,
+                body_uint64,
+                body_uint8,
+            ],
+        )
+        .expect("typed round-trip batch")
+    }
+
+    /// Render a single row's body-col cell to a comparable string.
+    /// Used by the round-trip test to compare logical values across
+    /// the merge. MC-2 (row contents preservation) is about logical
+    /// values, not internal storage layout: `Dictionary<i32, Utf8>`,
+    /// `Utf8`, and `LargeUtf8` carrying the same string are the same
+    /// row content; similarly for `Binary` / `LargeBinary`. The
+    /// streaming engine normalizes string-flavoured types to `Utf8`
+    /// via `normalize_type` in the union schema, and parquet has only
+    /// one byte-array physical type so `LargeBinary` round-trips as
+    /// `Binary`. Both transformations are storage-encoding changes,
+    /// not value changes — the comparison must see them as equal.
+    fn render_cell(arr: &dyn arrow::array::Array, row: usize) -> String {
+        use arrow::array::AsArray;
+        use arrow::datatypes::Int32Type as DictKeyInt32;
+
+        if !arr.is_valid(row) {
+            return "<null>".to_string();
+        }
+        match arr.data_type() {
+            DataType::Int8 => format!(
+                "i8:{}",
+                arr.as_primitive::<arrow::datatypes::Int8Type>().value(row)
+            ),
+            DataType::Int16 => format!(
+                "i16:{}",
+                arr.as_primitive::<arrow::datatypes::Int16Type>().value(row)
+            ),
+            DataType::Int32 => format!(
+                "i32:{}",
+                arr.as_primitive::<arrow::datatypes::Int32Type>().value(row)
+            ),
+            DataType::Int64 => format!(
+                "i64:{}",
+                arr.as_primitive::<arrow::datatypes::Int64Type>().value(row)
+            ),
+            DataType::UInt8 => format!(
+                "u8:{}",
+                arr.as_primitive::<arrow::datatypes::UInt8Type>().value(row)
+            ),
+            DataType::UInt16 => format!(
+                "u16:{}",
+                arr.as_primitive::<arrow::datatypes::UInt16Type>()
+                    .value(row)
+            ),
+            DataType::UInt32 => format!(
+                "u32:{}",
+                arr.as_primitive::<arrow::datatypes::UInt32Type>()
+                    .value(row)
+            ),
+            DataType::UInt64 => format!(
+                "u64:{}",
+                arr.as_primitive::<arrow::datatypes::UInt64Type>()
+                    .value(row)
+            ),
+            DataType::Float32 => {
+                format!(
+                    "f32:{:#x}",
+                    arr.as_primitive::<arrow::datatypes::Float32Type>()
+                        .value(row)
+                        .to_bits()
+                )
+            }
+            DataType::Float64 => {
+                format!(
+                    "f64:{:#x}",
+                    arr.as_primitive::<arrow::datatypes::Float64Type>()
+                        .value(row)
+                        .to_bits()
+                )
+            }
+            DataType::Boolean => format!("bool:{}", arr.as_boolean().value(row)),
+            // String flavours collapse to one rendering — Dict<i32,
+            // Utf8>, Utf8, LargeUtf8 are interchangeable by value.
+            DataType::Utf8 => format!("str:{}", arr.as_string::<i32>().value(row)),
+            DataType::LargeUtf8 => format!("str:{}", arr.as_string::<i64>().value(row)),
+            DataType::Dictionary(_, _) => {
+                let d = arr.as_dictionary::<DictKeyInt32>();
+                let key = d.keys().value(row);
+                let values = d
+                    .values()
+                    .as_any()
+                    .downcast_ref::<StringArray>()
+                    .expect("dict values Utf8");
+                format!("str:{}", values.value(key as usize))
+            }
+            // Byte arrays collapse similarly — parquet has only one
+            // BYTE_ARRAY physical type.
+            DataType::Binary => format!("bin:{:?}", arr.as_binary::<i32>().value(row)),
+            DataType::LargeBinary => format!("bin:{:?}", arr.as_binary::<i64>().value(row)),
+            DataType::List(_) => {
+                let list = arr.as_list::<i32>();
+                let inner = list.value(row);
+                let inner_f64 = inner.as_primitive::<arrow::datatypes::Float64Type>();
+                let cells: Vec<String> = (0..inner_f64.len())
+                    .map(|j| {
+                        if inner_f64.is_valid(j) {
+                            format!("{:#x}", inner_f64.value(j).to_bits())
+                        } else {
+                            "null".to_string()
+                        }
+                    })
+                    .collect();
+                format!("list_f64:[{}]", cells.join(","))
+            }
+            DataType::LargeList(_) => {
+                let list = arr.as_list::<i64>();
+                let inner = list.value(row);
+                let inner_f64 = inner.as_primitive::<arrow::datatypes::Float64Type>();
+                let cells: Vec<String> = (0..inner_f64.len())
+                    .map(|j| {
+                        if inner_f64.is_valid(j) {
+                            format!("{:#x}", inner_f64.value(j).to_bits())
+                        } else {
+                            "null".to_string()
+                        }
+                    })
+                    .collect();
+                format!("ll_f64:[{}]", cells.join(","))
+            }
+            other => panic!("unexpected body-col data type in round-trip: {other:?}"),
+        }
+    }
+
+    /// MC-2: row contents do not change during compaction. Build two
+    /// inputs that together cover every parquet physical type the
+    /// decoder supports, merge them via the streaming engine, then
+    /// build a `(sorted_series_key → rendered tuple)` map from both
+    /// inputs and from the output. The two maps must be byte-equal —
+    /// no row added, removed, or mutated. Catches silent type-
+    /// dispatch bugs (the class that the recent List<Float64>
+    /// flattening regression was in).
+    #[tokio::test]
+    async fn test_mc2_all_types_round_trip_through_streaming_merge() {
+        let batch_a = make_typed_round_trip_batch(120, 0);
+        let batch_b = make_typed_round_trip_batch(120, 10_000);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+        let bytes_b = write_input_parquet(std::slice::from_ref(&batch_b), &[]);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> =
+            vec![open_stream(bytes_a).await, open_stream(bytes_b).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, 240);
+
+        // Build the expected (sorted_series → rendered tuple) map from
+        // both inputs. Body cols are everything past `sorted_series`.
+        let mut expected: HashMap<Vec<u8>, String> = HashMap::new();
+        for batch in [&batch_a, &batch_b] {
+            let series_idx = batch.schema().index_of("sorted_series").expect("series");
+            let series_col = batch
+                .column(series_idx)
+                .as_any()
+                .downcast_ref::<BinaryArray>()
+                .expect("series Binary");
+            let body_indices: Vec<usize> =
+                (series_idx + 1..batch.schema().fields().len()).collect();
+            for row in 0..batch.num_rows() {
+                let mut tuple = String::new();
+                for (n, &col_idx) in body_indices.iter().enumerate() {
+                    if n > 0 {
+                        tuple.push('|');
+                    }
+                    tuple.push_str(&render_cell(batch.column(col_idx).as_ref(), row));
+                }
+                let key = series_col.value(row).to_vec();
+                let prior = expected.insert(key.clone(), tuple);
+                assert!(
+                    prior.is_none(),
+                    "input batches share a sorted_series key {key:?} — fixture bug",
+                );
+            }
+        }
+        assert_eq!(expected.len(), 240);
+
+        // Build the observed map from the merged output. Note arrow
+        // type coercions: Utf8 may come back as Dictionary because of
+        // per-output schema optimisation, and Int32-keyed Dict may
+        // come back with a different key width. Cast both sides to
+        // Utf8 / Float64 / etc. via the same `render_cell` helper so
+        // the comparison is type-insensitive.
+        let merged = read_output_to_record_batch(&outputs[0].path);
+        let merged_schema = merged.schema();
+        let series_idx = merged_schema.index_of("sorted_series").expect("series");
+        let series_col = merged
+            .column(series_idx)
+            .as_any()
+            .downcast_ref::<BinaryArray>()
+            .expect("merged series Binary");
+
+        // Map each body-col name in the inputs to its column index in
+        // the merged output (positions can shift if output schema
+        // dropped an all-null sort col).
+        let input_body_cols: Vec<String> = batch_a
+            .schema()
+            .fields()
+            .iter()
+            .skip(batch_a.schema().index_of("sorted_series").unwrap() + 1)
+            .map(|f| f.name().clone())
+            .collect();
+        let merged_body_indices: Vec<usize> = input_body_cols
+            .iter()
+            .map(|name| {
+                merged_schema.index_of(name).unwrap_or_else(|_| {
+                    panic!(
+                        "merged output is missing body col '{name}' — MC-4 column union violated"
+                    )
+                })
+            })
+            .collect();
+
+        let mut observed: HashMap<Vec<u8>, String> = HashMap::with_capacity(merged.num_rows());
+        for row in 0..merged.num_rows() {
+            let mut tuple = String::new();
+            for (n, &col_idx) in merged_body_indices.iter().enumerate() {
+                if n > 0 {
+                    tuple.push('|');
+                }
+                tuple.push_str(&render_cell(merged.column(col_idx).as_ref(), row));
+            }
+            let key = series_col.value(row).to_vec();
+            let prior = observed.insert(key.clone(), tuple);
+            assert!(
+                prior.is_none(),
+                "merged output has duplicate sorted_series key {key:?} — MC-1 violated",
+            );
+        }
+
+        // MC-1: same set of keys.
+        assert_eq!(
+            observed.len(),
+            expected.len(),
+            "row count mismatch input vs output",
+        );
+        for (key, want) in &expected {
+            let got = observed.get(key).unwrap_or_else(|| {
+                panic!(
+                    "merged output is missing input key {:?} (first body cell expected: {})",
+                    key,
+                    want.split('|').next().unwrap_or("?")
+                )
+            });
+            assert_eq!(
+                got, want,
+                "body-col tuple mismatch for sorted_series {key:?}: got {got}, want {want}",
+            );
+        }
+    }
+
+    /// Build two batches with the SAME sort schema but **different body-
+    /// col schemas**, exercising the merger's schema-evolution paths:
+    ///
+    /// - `body_string`: `Utf8` in A, `Dict<i32, Utf8>` in B (string-flavour variation —
+    ///   `normalize_type` collapses to Utf8).
+    /// - `body_bytes`: `LargeBinary` in A, `Binary` in B (byte-array-flavour variation — F6
+    ///   normalizer extension collapses to Binary).
+    /// - `body_list`: `List<Float64>` present in A, ABSENT from B (MC-4 column union — B's rows
+    ///   appear as nulls in the merged output; outer becomes nullable in the union, exercising the
+    ///   F13 nullable-outer list write path).
+    /// - `body_a_only`: `Int32` in A only (MC-4 column union — B's rows null).
+    /// - `body_b_only`: `Int64` in B only (MC-4 column union — A's rows null).
+    /// - `body_value`: `Float64` in both (common-typed control column).
+    ///
+    /// Each row is keyed by a unique `sorted_series` value, and the cell
+    /// values are derived from the key + column name so that comparison
+    /// is byte-stable.
+    fn make_mixed_schema_input_a(num_rows: usize, key_offset: u64) -> RecordBatch {
+        use arrow::array::{Int32Array, LargeBinaryArray, ListArray};
+        use arrow::buffer::OffsetBuffer;
+
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            // sort cols
+            Field::new("metric_name", dict_type, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            // body cols in lex order. body_a_only is A-exclusive.
+            // body_list is non-nullable here (A always has lists) but
+            // becomes nullable in the union because B lacks the col.
+            Field::new("body_a_only", DataType::Int32, true),
+            Field::new("body_bytes", DataType::LargeBinary, true),
+            Field::new(
+                "body_list",
+                DataType::List(Arc::new(Field::new("item", DataType::Float64, false))),
+                false,
+            ),
+            Field::new("body_string", DataType::Utf8, true),
+            Field::new("body_value", DataType::Float64, true),
+        ]));
+
+        let metric_keys: Vec<i32> = vec![0; num_rows];
+        let metric_values = StringArray::from(vec!["cpu.usage"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .expect("dict"),
+        );
+        let timestamps: Vec<u64> = (0..num_rows as u64)
+            .map(|i| 1_700_000_000 + (num_rows as u64 - i))
+            .collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let key_bytes: Vec<Vec<u8>> = (0..num_rows as u64)
+            .map(|i| (key_offset + i).to_be_bytes().to_vec())
+            .collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+            key_bytes.iter().map(|v| v.as_slice()).collect::<Vec<_>>(),
+        ));
+
+        let body_a_only: ArrayRef = Arc::new(Int32Array::from(
+            (0..num_rows).map(|i| i as i32 * 7 - 5).collect::<Vec<_>>(),
+        ));
+        let body_bytes_vals: Vec<Vec<u8>> = (0..num_rows)
+            .map(|i| format!("a-bytes-{i:04}").into_bytes())
+            .collect();
+        let body_bytes: ArrayRef = Arc::new(LargeBinaryArray::from(
+            body_bytes_vals
+                .iter()
+                .map(|v| Some(v.as_slice()))
+                .collect::<Vec<_>>(),
+        ));
+        let body_string_vals: Vec<String> =
+            (0..num_rows).map(|i| format!("a-str-{i:04}")).collect();
+        let body_string: ArrayRef = Arc::new(StringArray::from(body_string_vals));
+        let body_value: ArrayRef = Arc::new(Float64Array::from(
+            (0..num_rows).map(|i| i as f64 + 0.5).collect::<Vec<_>>(),
+        ));
+
+        // List<Float64> body col: row i has a list of length (i % 3) + 1.
+        let mut list_offsets: Vec<i32> = Vec::with_capacity(num_rows + 1);
+        let mut list_values: Vec<f64> = Vec::new();
+        list_offsets.push(0);
+        for i in 0..num_rows {
+            for j in 0..((i % 3) + 1) {
+                list_values.push((i * 10 + j) as f64 + 0.25);
+            }
+            list_offsets.push(list_values.len() as i32);
+        }
+        let list_inner: ArrayRef = Arc::new(Float64Array::from(list_values));
+        let body_list: ArrayRef = Arc::new(ListArray::new(
+            Arc::new(Field::new("item", DataType::Float64, false)),
+            OffsetBuffer::new(arrow::buffer::ScalarBuffer::from(list_offsets)),
+            list_inner,
+            None,
+        ));
+
+        RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                timestamp_secs,
+                sorted_series,
+                body_a_only,
+                body_bytes,
+                body_list,
+                body_string,
+                body_value,
+            ],
+        )
+        .expect("input A batch")
+    }
+
+    fn make_mixed_schema_input_b(num_rows: usize, key_offset: u64) -> RecordBatch {
+        use arrow::array::{BinaryArray as BinArr, Int64Array as I64};
+
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            // Same sort cols.
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            // Body cols in lex order. body_string is Dict (flavor
+            // change from A's Utf8). body_bytes is Binary (flavor
+            // change from A's LargeBinary). No body_list. Adds
+            // body_b_only.
+            Field::new("body_b_only", DataType::Int64, true),
+            Field::new("body_bytes", DataType::Binary, true),
+            Field::new("body_string", dict_type.clone(), true),
+            Field::new("body_value", DataType::Float64, true),
+        ]));
+
+        let metric_keys: Vec<i32> = vec![0; num_rows];
+        let metric_values = StringArray::from(vec!["cpu.usage"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .expect("dict"),
+        );
+        let timestamps: Vec<u64> = (0..num_rows as u64)
+            .map(|i| 1_700_000_000 + (num_rows as u64 - i))
+            .collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let key_bytes: Vec<Vec<u8>> = (0..num_rows as u64)
+            .map(|i| (key_offset + i).to_be_bytes().to_vec())
+            .collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+            key_bytes.iter().map(|v| v.as_slice()).collect::<Vec<_>>(),
+        ));
+
+        let body_b_only: ArrayRef = Arc::new(I64::from(
+            (0..num_rows)
+                .map(|i| (i as i64) * 1_000_003 + 17)
+                .collect::<Vec<_>>(),
+        ));
+        let body_bytes_vals: Vec<Vec<u8>> = (0..num_rows)
+            .map(|i| format!("b-bytes-{i:04}").into_bytes())
+            .collect();
+        let body_bytes: ArrayRef = Arc::new(BinArr::from(
+            body_bytes_vals
+                .iter()
+                .map(|v| Some(v.as_slice()))
+                .collect::<Vec<_>>(),
+        ));
+
+        // Dict-encoded body_string.
+        let body_string_pool: Vec<String> =
+            (0..num_rows).map(|i| format!("b-str-{i:04}")).collect();
+        let body_string_keys: Vec<i32> = (0..num_rows as i32).collect();
+        let body_string: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(body_string_keys),
+                Arc::new(StringArray::from(body_string_pool)),
+            )
+            .expect("dict"),
+        );
+
+        let body_value: ArrayRef = Arc::new(Float64Array::from(
+            (0..num_rows)
+                .map(|i| (i as f64) * -1.25 + 7.0)
+                .collect::<Vec<_>>(),
+        ));
+
+        RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                timestamp_secs,
+                sorted_series,
+                body_b_only,
+                body_bytes,
+                body_string,
+                body_value,
+            ],
+        )
+        .expect("input B batch")
+    }
+
+    /// F6 — MC-2/MC-4 across heterogeneous body schemas. Inputs A and B
+    /// share the sort schema but differ in:
+    /// - body_string flavour (Utf8 vs Dict<i32,Utf8>)
+    /// - body_bytes flavour (LargeBinary vs Binary) — relies on the F6 `normalize_type` extension
+    /// - body_list present in A, missing in B
+    /// - body_a_only / body_b_only present in only one input
+    ///
+    /// After merging, the output must contain the union of body cols
+    /// (MC-4) with B's rows null in body_list / body_a_only and A's
+    /// rows null in body_b_only. Shared-col values must survive (MC-2)
+    /// despite the flavour differences.
+    #[tokio::test]
+    async fn test_mc2_mixed_schemas_round_trip() {
+        let batch_a = make_mixed_schema_input_a(40, 0);
+        let batch_b = make_mixed_schema_input_b(40, 10_000);
+        let bytes_a = write_input_parquet(std::slice::from_ref(&batch_a), &[]);
+        let bytes_b = write_input_parquet(std::slice::from_ref(&batch_b), &[]);
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> =
+            vec![open_stream(bytes_a).await, open_stream(bytes_b).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge with mixed schemas");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(
+            outputs[0].num_rows, 80,
+            "rows preserved across schema-evolved inputs (MC-1)",
+        );
+
+        let merged = read_output_to_record_batch(&outputs[0].path);
+        let merged_schema = merged.schema();
+
+        // MC-4: union of body-col names must appear in the output. The
+        // streaming engine drops all-null sort cols, but body cols
+        // present in any input must survive (even if half the rows
+        // are null).
+        for col_name in [
+            "body_a_only",
+            "body_b_only",
+            "body_bytes",
+            "body_list",
+            "body_string",
+            "body_value",
+        ] {
+            merged_schema.index_of(col_name).unwrap_or_else(|_| {
+                panic!("merged output is missing body col '{col_name}' — MC-4 violated")
+            });
+        }
+
+        // MC-2: build (sorted_series → tuple) maps for each input and
+        // the output, then verify every input row's tuple matches the
+        // output row's tuple. `render_cell` normalizes the same
+        // string/byte flavour rendering so the type variations don't
+        // generate spurious diffs.
+        let series_idx_a = batch_a.schema().index_of("sorted_series").unwrap();
+        let series_idx_b = batch_b.schema().index_of("sorted_series").unwrap();
+        let body_col_names_union: Vec<&str> = vec![
+            "body_a_only",
+            "body_b_only",
+            "body_bytes",
+            "body_list",
+            "body_string",
+            "body_value",
+        ];
+
+        let render_row = |batch: &RecordBatch, row: usize| -> String {
+            let mut tuple = String::new();
+            for (n, &name) in body_col_names_union.iter().enumerate() {
+                if n > 0 {
+                    tuple.push('|');
+                }
+                match batch.schema().index_of(name) {
+                    Ok(col_idx) => {
+                        tuple.push_str(&render_cell(batch.column(col_idx).as_ref(), row))
+                    }
+                    // Col absent from this input → null when seen
+                    // in the merged output.
+                    Err(_) => tuple.push_str("<null>"),
+                }
+            }
+            tuple
+        };
+
+        let mut expected: HashMap<Vec<u8>, String> = HashMap::new();
+        let series_col_a = batch_a
+            .column(series_idx_a)
+            .as_any()
+            .downcast_ref::<BinaryArray>()
+            .unwrap();
+        for row in 0..batch_a.num_rows() {
+            let key = series_col_a.value(row).to_vec();
+            expected.insert(key, render_row(&batch_a, row));
+        }
+        let series_col_b = batch_b
+            .column(series_idx_b)
+            .as_any()
+            .downcast_ref::<BinaryArray>()
+            .unwrap();
+        for row in 0..batch_b.num_rows() {
+            let key = series_col_b.value(row).to_vec();
+            expected.insert(key, render_row(&batch_b, row));
+        }
+        assert_eq!(expected.len(), 80);
+
+        let series_idx_out = merged_schema.index_of("sorted_series").unwrap();
+        let series_col_out = merged
+            .column(series_idx_out)
+            .as_any()
+            .downcast_ref::<BinaryArray>()
+            .unwrap();
+        let mut observed: HashMap<Vec<u8>, String> = HashMap::with_capacity(merged.num_rows());
+        for row in 0..merged.num_rows() {
+            let key = series_col_out.value(row).to_vec();
+            observed.insert(key, render_row(&merged, row));
+        }
+        assert_eq!(observed.len(), 80, "MC-1 — no rows lost");
+
+        for (key, want) in &expected {
+            let got = observed
+                .get(key)
+                .unwrap_or_else(|| panic!("missing key {key:?} in merged output"));
+            assert_eq!(
+                got, want,
+                "MC-2 mismatch for sorted_series {key:?}: got {got}, want {want}",
+            );
+        }
+    }
+
+    // ============================================================================
+    // MS-7: page-cache bounded-memory contract. The streaming engine's
+    // raison d'être is that body-col memory stays bounded by ~constant
+    // (page size × small) regardless of how big the input column gets.
+    // Concretely, `body_col_page_cache` length per input must stay ≤ a
+    // small constant — never scale with row count.
+    // ============================================================================
+
+    /// `PEAK_BODY_COL_PAGE_CACHE_LEN` is a process-global atomic, so
+    /// concurrent MS-7 tests would pollute each other's readings.
+    /// Re-exported from the module-level helper so existing MS-7 tests
+    /// in this submodule keep their unqualified name; other test
+    /// modules call `super::ms7_serial_lock` (or the full path)
+    /// directly.
+    use super::ms7_serial_lock;
+
+    /// Build a fixture that forces many input body-col pages with a
+    /// pinned `data_page_row_count_limit`, then merge it through the
+    /// streaming engine and read back the peak cache length. Used by
+    /// the MS-7 test below across multiple sizes.
+    ///
+    /// Caller must hold [`ms7_serial_lock`] across this call.
+    async fn merge_and_observe_peak_page_cache(num_rows: usize, page_rows: usize) -> usize {
+        merge_n_inputs_and_observe_peak_page_cache(1, num_rows, page_rows).await
+    }
+
+    /// Run the same merge through `num_inputs` identical small-pages
+    /// fixtures and return the peak per-input cache length observed.
+    ///
+    /// `PEAK_BODY_COL_PAGE_CACHE_LEN` is set per-input by the body
+    /// assembler — it tracks `body_col_caches_total_len` (sum across
+    /// body cols for a single `InputDecoderState`). The atomic stores
+    /// `max(per-input)`, NOT the sum, so the bound should hold flat
+    /// regardless of `num_inputs`. Each input's body-col cache is
+    /// independent; adding more inputs adds parallel caches but
+    /// doesn't push any one input's cache size up. The MS-7
+    /// per-input invariant must hold in the multi-input case the
+    /// same way it does in single-input.
+    ///
+    /// The fixture writes `num_inputs` distinct files (each with a
+    /// unique sort-key offset so the merge engine genuinely
+    /// interleaves them — using identical files would let the engine
+    /// take shortcuts). Each input has `num_rows_per_input` rows on
+    /// the same single metric_name, so all rows fall in one region.
+    async fn merge_n_inputs_and_observe_peak_page_cache(
+        num_inputs: usize,
+        num_rows_per_input: usize,
+        page_rows: usize,
+    ) -> usize {
+        use std::sync::atomic::Ordering;
+
+        PEAK_BODY_COL_PAGE_CACHE_LEN.store(0, Ordering::Relaxed);
+
+        let mut inputs: Vec<Box<dyn ColumnPageStream>> = Vec::with_capacity(num_inputs);
+        for i in 0..num_inputs {
+            // Stagger sort-key offsets so each input has a disjoint
+            // sorted_series range. Without this the merge engine
+            // would compare row-equal keys and the interleaving
+            // pattern wouldn't exercise per-input page caches.
+            let key_offset = i as u64 * (num_rows_per_input as u64);
+            let batch = make_sorted_batch(num_rows_per_input, key_offset);
+            let bytes =
+                write_input_parquet_with_small_pages(std::slice::from_ref(&batch), page_rows);
+            inputs.push(open_stream(bytes).await);
+        }
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, num_rows_per_input * num_inputs);
+
+        PEAK_BODY_COL_PAGE_CACHE_LEN.load(Ordering::Relaxed)
+    }
+
+    /// MS-7: peak `body_col_page_cache` length is bounded by the
+    /// ratio of output-page rows to input-page rows — NOT by input
+    /// column size. Each `assemble_one_output_page` call must read
+    /// enough input pages to cover `OUTPUT_PAGE_ROWS` rows, then
+    /// evict everything below the new cursor. So the peak per advance
+    /// loop is bounded by `ceil(OUTPUT_PAGE_ROWS / page_rows) + small
+    /// slack` regardless of how many output pages we produce.
+    ///
+    /// Without the per-output-page eviction, the peak would scale with
+    /// the total number of input pages — which scales with input size.
+    /// A regression that dropped the eviction loop would push the peak
+    /// past the ceiling for the 30 000-row fixture but not the 3 000-
+    /// row fixture, breaking both assertions below.
+    #[allow(clippy::await_holding_lock, reason = "see ms7_serial_lock docs")]
+    #[tokio::test]
+    async fn test_ms7_body_col_page_cache_bounded_regardless_of_input_size() {
+        let _guard = ms7_serial_lock();
+        const PAGE_ROWS: usize = 50;
+        // ceil(1024 / 50) = 21 in-flight pages needed for one output
+        // page. Allow 3 slack: decoder lookahead, transient between
+        // push and check, dict-page-as-data-page corner cases.
+        const MAX_RESIDENT_PAGES: usize = 24;
+
+        let peak_small = merge_and_observe_peak_page_cache(300, PAGE_ROWS).await;
+        let peak_medium = merge_and_observe_peak_page_cache(3_000, PAGE_ROWS).await;
+        let peak_large = merge_and_observe_peak_page_cache(30_000, PAGE_ROWS).await;
+
+        // 300-row fixture has only 6 input pages, so its peak can't
+        // exceed 6; verifying that the assembler doesn't somehow
+        // accumulate ghost entries past the input itself.
+        assert!(
+            peak_small <= 6 + 1,
+            "300-row fixture: peak cache len {peak_small} > 7",
+        );
+        // Medium / large fixtures share the same OUTPUT_PAGE_ROWS /
+        // PAGE_ROWS ratio, so they must share the same peak ceiling.
+        assert!(
+            peak_medium <= MAX_RESIDENT_PAGES,
+            "3 000-row fixture: peak cache len {peak_medium} > {MAX_RESIDENT_PAGES}",
+        );
+        assert!(
+            peak_large <= MAX_RESIDENT_PAGES,
+            "30 000-row fixture: peak cache len {peak_large} > {MAX_RESIDENT_PAGES} — body col \
+             write is no longer page-bounded; likely buffering whole column chunks",
+        );
+
+        // The headline MS-7 claim: peak DOES NOT grow proportionally
+        // with input size. Going from 3 000 to 30 000 rows multiplies
+        // total input pages by 10, but peak resident cache should
+        // stay essentially flat. Allow a 2-page slack for transients.
+        let growth = peak_large.saturating_sub(peak_medium);
+        assert!(
+            growth <= 2,
+            "peak grows with input size: medium={peak_medium}, large={peak_large} — 10× more \
+             input pages produced {growth} more resident pages, body-col path is not page-bounded",
+        );
+    }
+
+    /// MS-7 cross-input precondition: the body-col page cache bound
+    /// holds only if every input emits body cols lexicographically
+    /// (Husky order). A misaligned input would force
+    /// `fill_page_cache_to_row` to drain every body col preceding the
+    /// requested one in storage order — those pages stay cached under
+    /// their own col_idx until that col's turn comes up in the union
+    /// iteration, growing the cache to a full column chunk per
+    /// misaligned col. `assert_inputs_in_husky_body_col_order` rejects
+    /// such inputs at merge entry. This test constructs an input with
+    /// body cols `[value, metric_type]` (alphabetical would be
+    /// `[metric_type, value]`) and verifies the merge bails before
+    /// phase 3 runs.
+    #[tokio::test]
+    async fn test_assert_inputs_in_husky_body_col_order_rejects_misaligned_input() {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            // Body cols out of order — 'value' precedes 'metric_type'
+            // in storage, but alphabetical would be metric_type then
+            // value.
+            Field::new("value", DataType::Float64, false),
+            Field::new("metric_type", DataType::UInt8, false),
+        ]));
+
+        let metric_keys: Vec<i32> = vec![0, 0, 0];
+        let metric_values = StringArray::from(vec!["cpu.usage"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .expect("test dict"),
+        );
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(vec![
+            1_700_000_003u64,
+            1_700_000_002,
+            1_700_000_001,
+        ]));
+        let series: Vec<Vec<u8>> = (0u64..3).map(|i| i.to_be_bytes().to_vec()).collect();
+        let series_refs: Vec<&[u8]> = series.iter().map(|v| v.as_slice()).collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(series_refs));
+        let value: ArrayRef = Arc::new(Float64Array::from(vec![0.0, 1.0, 2.0]));
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8, 0, 0]));
+
+        let batch = RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                timestamp_secs,
+                sorted_series,
+                value,
+                metric_type,
+            ],
+        )
+        .expect("test batch");
+        let bytes = write_input_parquet_with_small_pages(std::slice::from_ref(&batch), 100);
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let err = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect_err("must reject misaligned body cols");
+        let err_str = format!("{err:#}");
+        assert!(
+            err_str.contains("not in Husky") && err_str.contains("'value'"),
+            "expected Husky-order error mentioning 'value', got: {err_str}",
+        );
+    }
+
+    /// MS-7 multi-input dimension: the per-input page-cache bound must
+    /// hold across `num_inputs ∈ {1, 3, 8}` independent inputs being
+    /// merged together. Each input has its own `InputDecoderState` and
+    /// therefore its own per-col page cache; the atomic
+    /// `PEAK_BODY_COL_PAGE_CACHE_LEN` stores
+    /// `max(body_col_caches_total_len)` over all inputs, so the bound
+    /// being checked is per-input, not summed across inputs.
+    ///
+    /// The motivation: production merges combine many splits at once.
+    /// A regression that, say, shared a buffer across input states (so
+    /// pages from N inputs piled into the same cache) would push the
+    /// observed peak to roughly N × the single-input peak. The current
+    /// design keeps caches independent — this test pins that invariant.
+    ///
+    /// Sweep over `num_inputs ∈ {1, 3, 8}` AND
+    /// `rows_per_input ∈ {3_000, 30_000}` so a "peak grows with
+    /// num_inputs" regression and a "peak grows with rows" regression
+    /// would each surface separately.
+    #[allow(clippy::await_holding_lock, reason = "see ms7_serial_lock docs")]
+    #[tokio::test]
+    async fn test_ms7_per_input_bound_across_num_inputs() {
+        let _guard = ms7_serial_lock();
+        const PAGE_ROWS: usize = 50;
+        // Same constant as the single-input MS-7 test:
+        // ceil(OUTPUT_PAGE_ROWS / PAGE_ROWS) + slack.
+        const MAX_RESIDENT_PAGES_PER_INPUT: usize = 24;
+
+        for &num_inputs in &[1usize, 3, 8] {
+            for &rows_per_input in &[3_000usize, 30_000] {
+                let peak = merge_n_inputs_and_observe_peak_page_cache(
+                    num_inputs,
+                    rows_per_input,
+                    PAGE_ROWS,
+                )
+                .await;
+                assert!(
+                    peak <= MAX_RESIDENT_PAGES_PER_INPUT,
+                    "num_inputs={num_inputs}, rows_per_input={rows_per_input}: per-input peak \
+                     cache len {peak} > {MAX_RESIDENT_PAGES_PER_INPUT} — body-col caches are no \
+                     longer per-input-independent",
+                );
+            }
+        }
+
+        // Cross-axis growth: 1→8 inputs at the same row count must
+        // not push peak up. The atomic tracks max-per-input, so a
+        // value increase between (1 input, 30 000 rows) and (8 inputs,
+        // 30 000 rows) would indicate caches bleeding across inputs.
+        let peak_1in = merge_n_inputs_and_observe_peak_page_cache(1, 30_000, PAGE_ROWS).await;
+        let peak_8in = merge_n_inputs_and_observe_peak_page_cache(8, 30_000, PAGE_ROWS).await;
+        let growth = peak_8in.saturating_sub(peak_1in);
+        assert!(
+            growth <= 2,
+            "peak grows with num_inputs: 1in={peak_1in}, 8in={peak_8in} — caches likely shared \
+             across inputs",
+        );
+    }
+
+    /// MS-7 sub-region dimension: when `prefix_len = 0` +
+    /// `num_outputs > 1` triggers `split_region_at_sorted_series`,
+    /// the per-input page cache survives across sub-region boundaries
+    /// within one top-level region (the engine reads col K once and
+    /// re-uses it across sub-regions). The headline MS-7 claim must
+    /// still hold: peak DOES NOT scale with input row count.
+    ///
+    /// This test guards the F14 fix: `process_split_region_col_outer`
+    /// inverts the col/sub-region loop nesting so each parquet col
+    /// chunk is fully consumed from the page stream before the next
+    /// col starts. Before that fix, peak scaled ~linearly with row
+    /// count (3 000 rows → 140 cached pages; 30 000 rows → 1 200
+    /// pages, ~9× growth for 10× rows). With the fix, the per-input
+    /// peak stays bounded by `K × num_body_cols × num_chunks` where
+    /// K is the per-col output-page bound — independent of row
+    /// count.
+    ///
+    /// Fixture: single multi-metric single-RG input (prefix_len = 0)
+    /// with 6 distinct metric_names. `num_outputs = 3` forces the
+    /// engine into the drain-and-align-then-split path. Two row-count
+    /// regimes (small + large): peak at the large size must be within
+    /// a small additive slack of the small size's peak.
+    #[allow(clippy::await_holding_lock, reason = "see ms7_serial_lock docs")]
+    #[tokio::test]
+    async fn test_ms7_per_input_bound_across_sub_regions_does_not_scale_with_rows() {
+        let _guard = ms7_serial_lock();
+        use std::sync::atomic::Ordering;
+
+        const PAGE_ROWS: usize = 50;
+        // Slack covers: page-decoder lookahead, transient between
+        // push and check across body cols, the inter-col handoff
+        // where col K's cache hasn't drained before col K+1's first
+        // page arrives.
+        const SCALING_SLACK_PAGES: usize = 4;
+
+        async fn run(rows_per_metric: usize) -> usize {
+            PEAK_BODY_COL_PAGE_CACHE_LEN.store(0, Ordering::Relaxed);
+            let metrics: Vec<(&str, usize)> = vec![
+                ("aaa.alpha", rows_per_metric),
+                ("bbb.beta", rows_per_metric),
+                ("ccc.gamma", rows_per_metric),
+                ("ddd.delta", rows_per_metric),
+                ("eee.epsilon", rows_per_metric),
+                ("fff.zeta", rows_per_metric),
+            ];
+            let bytes = make_multi_metric_single_rg_input_with_small_pages(&metrics, PAGE_ROWS);
+            let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+            let tmp = TempDir::new().expect("tmpdir");
+            let outputs =
+                streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(3))
+                    .await
+                    .expect("merge");
+            assert_eq!(outputs.len(), 3, "sub-region split must honor num_outputs");
+            let total: usize = outputs.iter().map(|o| o.num_rows).sum();
+            assert_eq!(total, rows_per_metric * 6);
+            PEAK_BODY_COL_PAGE_CACHE_LEN.load(Ordering::Relaxed)
+        }
+
+        // Small: 6 × 500 = 3 000 rows = 60 input pages per col chunk.
+        let peak_small = run(500).await;
+        // Large: 6 × 5 000 = 30 000 rows = 600 input pages per col chunk.
+        let peak_large = run(5_000).await;
+
+        // The MS-7 invariant: peak does not grow proportionally with
+        // input row count. 10× more rows must produce roughly the
+        // same peak — any growth beyond `SCALING_SLACK_PAGES` means
+        // the cache is no longer page-bounded in the sub-region
+        // path.
+        let growth = peak_large.saturating_sub(peak_small);
+        assert!(
+            growth <= SCALING_SLACK_PAGES,
+            "sub-region path violates MS-7 scaling: small (3 000 rows)={peak_small}, large (30 \
+             000 rows)={peak_large} — 10× more rows produced {growth} more resident pages, \
+             body-col path is not page-bounded across sub-regions",
+        );
+    }
+
+    /// Like `make_multi_metric_single_rg_input` but writes the file
+    /// with a pinned `data_page_row_count_limit` so the body-col path
+    /// actually spans many parquet pages. Otherwise the default
+    /// `data_page_size` byte threshold dominates and small fixtures
+    /// emit one page per col chunk — no opportunity to observe the
+    /// page-cache bound.
+    fn make_multi_metric_single_rg_input_with_small_pages(
+        metrics: &[(&str, usize)],
+        page_rows: usize,
+    ) -> Bytes {
+        // Build the same arrow batch shape as
+        // `make_multi_metric_single_rg_input`, then route through the
+        // small-pages writer instead of the default-config writer.
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("service", dict_type, true),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let total_rows: usize = metrics.iter().map(|(_, n)| *n).sum();
+        let mut metric_keys: Vec<i32> = Vec::with_capacity(total_rows);
+        let mut metric_values_vec: Vec<&str> = Vec::with_capacity(metrics.len());
+        let mut timestamps: Vec<u64> = Vec::with_capacity(total_rows);
+        let mut series_bytes: Vec<Vec<u8>> = Vec::with_capacity(total_rows);
+        let mut tsids: Vec<i64> = Vec::with_capacity(total_rows);
+        let mut values: Vec<f64> = Vec::with_capacity(total_rows);
+        let mut row_idx: u64 = 0;
+        for (metric_idx, (name, num)) in metrics.iter().enumerate() {
+            metric_values_vec.push(name);
+            for r in 0..*num {
+                metric_keys.push(metric_idx as i32);
+                timestamps.push(1_700_000_000 + ((*num - r) as u64));
+                series_bytes.push(row_idx.to_be_bytes().to_vec());
+                tsids.push(1000 + row_idx as i64);
+                values.push(row_idx as f64);
+                row_idx += 1;
+            }
+        }
+        let metric_names_arr = StringArray::from(metric_values_vec);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_names_arr),
+            )
+            .expect("dict"),
+        );
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let series_refs: Vec<&[u8]> = series_bytes.iter().map(|v| v.as_slice()).collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(series_refs));
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; total_rows]));
+        let service: ArrayRef = {
+            let svc_keys: Vec<Option<i32>> = (0..total_rows as i32).map(|i| Some(i % 3)).collect();
+            let svc_values = StringArray::from(vec!["api", "db", "cache"]);
+            Arc::new(
+                DictionaryArray::<Int32Type>::try_new(
+                    arrow::array::Int32Array::from(svc_keys),
+                    Arc::new(svc_values),
+                )
+                .expect("svc dict"),
+            )
+        };
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+
+        let batch = RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                timestamp_secs,
+                sorted_series,
+                metric_type,
+                service,
+                timeseries_id,
+                value,
+            ],
+        )
+        .expect("batch");
+
+        write_input_parquet_with_small_pages(&[batch], page_rows)
+    }
+
+    // ============================================================================
+    // Heterogeneous-output regressions (Codex P2 batch on PR-6409)
+    // ============================================================================
+
+    /// Regression for "Preserve service tags when service is a sort
+    /// column". If the sort schema places `service` in the sort key
+    /// (e.g., `metric_name|service|...`), the streaming engine writes
+    /// it via the sort-col path and the body-col `track_service`
+    /// branch never runs — so `MergeOutputFile.low_cardinality_tags`
+    /// historically came back empty even though every row in the
+    /// output has a service value. The fix folds in service names
+    /// from the per-output sort batch at finalize time.
+    #[tokio::test]
+    async fn test_service_as_sort_column_still_populates_low_cardinality_tags() {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            // Sort cols in sort schema order: metric_name, service,
+            // timestamp_secs (timestamp comes last per the sort
+            // validator's requirement).
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("service", DataType::Utf8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let row_count = 30usize;
+        let metric_keys: Vec<i32> = vec![0; row_count];
+        let metric_values = StringArray::from(vec!["cpu.usage"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .expect("dict"),
+        );
+        // service grouped in ASC order so the input is genuinely
+        // sorted by (metric ASC, service ASC, timestamp DESC) and the
+        // engine's MC-3 sort verifier accepts it.
+        let mut services_sorted: Vec<&str> = Vec::with_capacity(row_count);
+        for s in ["api", "cache", "db"] {
+            for _ in 0..(row_count / 3) {
+                services_sorted.push(s);
+            }
+        }
+        let service: ArrayRef = Arc::new(StringArray::from(services_sorted));
+        let timestamps: Vec<u64> = (0..row_count as u64)
+            .map(|i| 1_700_000_000 + (row_count as u64 - (i % 10)))
+            .collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let series_bytes: Vec<Vec<u8>> = (0..row_count as u64)
+            .map(|i| i.to_be_bytes().to_vec())
+            .collect();
+        let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+            series_bytes
+                .iter()
+                .map(|v| v.as_slice())
+                .collect::<Vec<_>>(),
+        ));
+        let value: ArrayRef = Arc::new(Float64Array::from(
+            (0..row_count).map(|i| i as f64).collect::<Vec<_>>(),
+        ));
+
+        let batch = RecordBatch::try_new(
+            schema.clone(),
+            vec![metric_name, service, timestamp_secs, sorted_series, value],
+        )
+        .expect("batch");
+
+        let sort_fields_str = "metric_name|service|-timestamp_secs/V2";
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                sort_fields_str.to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            Vec::new(),
+            Some(kvs),
+            &[
+                "metric_name".to_string(),
+                "service".to_string(),
+                "timestamp_secs".to_string(),
+            ],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer =
+            ArrowWriter::try_new(&mut buf, schema.clone(), Some(props)).expect("writer");
+        writer.write(&batch).expect("write");
+        writer.close().expect("close");
+
+        let inputs: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(Bytes::from(buf)).await];
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(inputs, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        let svc_tags = outputs[0].low_cardinality_tags.get(TAG_SERVICE).expect(
+            "MergeOutputFile.low_cardinality_tags must contain TAG_SERVICE even when service is a \
+             sort column",
+        );
+        let mut got: Vec<String> = svc_tags.iter().cloned().collect();
+        got.sort();
+        assert_eq!(
+            got,
+            vec!["api".to_string(), "cache".to_string(), "db".to_string()],
+            "service-name set must cover every distinct value in the sort col",
+        );
+    }
+
+    // ============================================================================
+    // (b) Legacy-promotion executor: an end-to-end test through
+    //     `ParquetMergeOperation::promote_legacy` →
+    //     `execute_merge_operation` → streaming engine output.
+    // ============================================================================
+
+    /// Build a minimal `ParquetSplitMetadata` for use by the promotion
+    /// executor. Only the routing-relevant fields are real — the rest
+    /// are placeholders sized to match across inputs so MP-3 holds.
+    fn make_promotion_split_meta(
+        split_id: &str,
+        rg_partition_prefix_len: u32,
+    ) -> crate::split::ParquetSplitMetadata {
+        crate::split::ParquetSplitMetadata::metrics_builder()
+            .split_id(crate::split::ParquetSplitId::new(split_id))
+            .index_uid("test-index:00000000000000000000000001")
+            .partition_id(0)
+            .time_range(crate::split::TimeRange::new(1_700_000_000, 1_700_000_060))
+            .sort_fields("metric_name|-timestamp_secs/V2")
+            .window_start_secs(1_700_000_000)
+            .window_duration_secs(60)
+            .num_merge_ops(0)
+            .rg_partition_prefix_len(rg_partition_prefix_len)
+            .num_rows(0)
+            .size_bytes(0)
+            .build()
+    }
+
+    /// End-to-end legacy promotion: a prefix_len=0 legacy single-RG
+    /// input + a prefix_len=1 aligned multi-RG input, merged via
+    /// `execute_merge_operation` with `target_prefix_len = 1`. The
+    /// legacy input is routed through `LegacyInputAdapter`, the
+    /// aligned one goes direct. The streaming engine sees uniform
+    /// prefix_len=1 inputs and produces aligned multi-RG output that
+    /// passes `assert_unique_rg_prefix_keys`.
+    #[tokio::test]
+    async fn test_promote_legacy_executor_end_to_end() {
+        use crate::merge::execute_merge_operation;
+        use crate::merge::policy::ParquetMergeOperation;
+        use crate::storage::RemoteByteSource;
+
+        // Legacy input: 3 metrics × 20 rows, prefix_len = 0.
+        let legacy_bytes = make_multi_metric_single_rg_input(&[
+            ("cpu.usage", 20),
+            ("memory.used", 20),
+            ("net.bytes", 20),
+        ]);
+        let legacy_meta = make_promotion_split_meta("legacy_001", 0);
+
+        // Aligned input: 2 metrics × 30 rows in 2 RGs, prefix_len = 1.
+        // `make_two_metric_aligned_input` uses metric names "cpu.usage"
+        // and "memory.used", overlapping with the legacy input — the
+        // merge engine must interleave by metric_name across inputs.
+        let aligned_bytes = make_two_metric_aligned_input();
+        let aligned_meta = make_promotion_split_meta("aligned_001", 1);
+
+        let op = ParquetMergeOperation::promote_legacy(vec![legacy_meta, aligned_meta], 1);
+
+        let sources: Vec<Arc<dyn RemoteByteSource>> = vec![
+            Arc::new(InMemorySource {
+                bytes: legacy_bytes,
+            }),
+            Arc::new(InMemorySource {
+                bytes: aligned_bytes,
+            }),
+        ];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = execute_merge_operation(&op, sources, tmp.path(), &merge_config(1))
+            .await
+            .expect("promote-legacy merge");
+
+        assert_eq!(outputs.len(), 1, "single output file");
+        assert_eq!(
+            outputs[0].num_rows,
+            60 + 60,
+            "legacy 3×20 + aligned 2×30 rows = 120 total",
+        );
+
+        // The streaming engine sees the legacy input as prefix_len=1
+        // (adapter rewrote the KV), and the aligned input is already
+        // prefix_len=1. So the output should advertise prefix_len=1.
+        assert_eq!(
+            outputs[0].output_rg_partition_prefix_len, 1,
+            "executor must produce an output that carries the promoted prefix_len",
+        );
+
+        // Read the on-disk KV: must match. (CS-1 inside the file.)
+        let bytes_out = std::fs::read(&outputs[0].path).expect("read");
+        let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+        let prefix_kv = reader
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .and_then(|kvs| {
+                kvs.iter()
+                    .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                    .and_then(|k| k.value.clone())
+            });
+        assert_eq!(
+            prefix_kv.as_deref(),
+            Some("1"),
+            "on-disk KV must declare prefix_len = 1 (target)",
+        );
+
+        // Output should have one RG per distinct metric_name. The
+        // union of legacy + aligned metric names is {cpu.usage,
+        // memory.used, net.bytes} → 3 RGs.
+        assert_eq!(
+            reader.metadata().num_row_groups(),
+            3,
+            "output should have one RG per distinct metric_name across inputs",
+        );
+
+        // PA-1 + PA-3: each output RG carries a unique, constant
+        // metric_name. This is the strong chunk-stats check that
+        // motivated F2.
+        assert_unique_rg_prefix_keys(
+            reader.metadata(),
+            "metric_name|-timestamp_secs/V2",
+            1,
+            "test_promote_legacy_executor_end_to_end output",
+        )
+        .expect("promoted output must satisfy PA-1 + PA-3");
+
+        // Metastore record (CS-1): builds successfully with
+        // mixed_prefix_ok = true, and `rg_partition_prefix_len`
+        // matches the on-disk KV.
+        let metastore_meta = crate::merge::metadata_aggregation::merge_parquet_split_metadata(
+            op.splits_as_slice(),
+            &outputs[0],
+            /* mixed_prefix_ok */ true,
+        )
+        .expect("metastore aggregation accepts mixed-prefix inputs in promotion mode");
+        assert_eq!(
+            metastore_meta.rg_partition_prefix_len, 1,
+            "metastore prefix_len must match the on-disk KV (CS-1)",
+        );
+    }
+
+    /// Negative: if the executor is given more sources than splits, or
+    /// fewer, it bails up-front rather than producing a partial merge.
+    #[tokio::test]
+    async fn test_executor_mismatched_sources_count_bails() {
+        use crate::merge::execute_merge_operation;
+        use crate::merge::policy::ParquetMergeOperation;
+        use crate::storage::RemoteByteSource;
+
+        let legacy_meta = make_promotion_split_meta("a", 0);
+        let aligned_meta = make_promotion_split_meta("b", 1);
+        let op = ParquetMergeOperation::promote_legacy(vec![legacy_meta, aligned_meta], 1);
+
+        // One source for two splits.
+        let sources: Vec<Arc<dyn RemoteByteSource>> = vec![Arc::new(InMemorySource {
+            bytes: make_multi_metric_single_rg_input(&[("cpu.usage", 10)]),
+        })];
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let err = execute_merge_operation(&op, sources, tmp.path(), &merge_config(1))
+            .await
+            .expect_err("mismatched source/split count must fail");
+        let msg = err.to_string();
+        assert!(
+            msg.contains("sources.len()") && msg.contains("op.splits.len()"),
+            "error should explain the mismatch, got: {msg}",
+        );
+    }
+
+    // ============================================================================
+    // F5: prefix-aware proptest (varies the new prefix machinery to surface
+    //     regressions the hand-picked fixtures miss).
+    // ============================================================================
+
+    /// F7 — production-shape integration test for the streaming
+    /// engine. The "compact many already-aligned splits" scenario:
+    ///
+    /// - 5 inputs, each prefix_len=1 multi-RG (one RG per metric_name).
+    /// - 15 distinct metric_names total across all inputs, with overlap (every metric appears in
+    ///   2-3 of the 5 inputs).
+    /// - `num_outputs = 4` so the engine has to split by region group while keeping every output
+    ///   prefix-aligned.
+    ///
+    /// Asserts the full "production-shape" invariant bundle:
+    /// - **MC-1**: total rows preserved.
+    /// - **MS-3**: every `MergeOutputFile.num_row_groups` matches its footer.
+    /// - **PA-1 + PA-3**: every output passes `assert_unique_rg_prefix_keys` with prefix_len = 1.
+    /// - **MS-5**: across adjacent output files, sorted_series is monotone non-decreasing — output
+    ///   K's last row's sorted_series ≤ output K+1's first row's sorted_series. A single metric CAN
+    ///   span output boundaries (the engine splits at sorted_series transitions inside an
+    ///   overflowing metric region), so the cross-output invariant is sorted_series monotonicity,
+    ///   not "each metric in one output".
+    /// - **CS-1**: each output's `MergeOutputFile.output_rg_partition_prefix_len` matches the
+    ///   on-disk KV.
+    ///
+    /// This is the corner the original adversarial review flagged as
+    /// "untested production case": multi-input × multi-RG ×
+    /// multi-output × prefix_len > 0. Every piece is tested in
+    /// isolation by hand-picked fixtures elsewhere; this test
+    /// exercises them together.
+    #[tokio::test]
+    async fn test_f7_production_shape_multi_input_multi_rg_multi_output() {
+        // Six metric pool. Each input picks 4-5 from this pool so
+        // there's overlap (the merge has to combine same-metric RGs
+        // across inputs).
+        let pool = [
+            "aaa.alpha",
+            "bbb.beta",
+            "ccc.gamma",
+            "ddd.delta",
+            "eee.epsilon",
+            "fff.zeta",
+        ];
+        // Each input is a sorted prefix of the pool with a specific
+        // row count per metric. Different starts/lengths so the
+        // inputs aren't identical.
+        let inputs_spec: Vec<Vec<(&str, usize)>> = vec![
+            // input 0: aaa, bbb, ccc, ddd
+            vec![(pool[0], 40), (pool[1], 30), (pool[2], 50), (pool[3], 20)],
+            // input 1: aaa, bbb, eee, fff
+            vec![(pool[0], 25), (pool[1], 35), (pool[4], 45), (pool[5], 30)],
+            // input 2: bbb, ccc, ddd, eee
+            vec![(pool[1], 20), (pool[2], 40), (pool[3], 30), (pool[4], 25)],
+            // input 3: aaa, ccc, fff
+            vec![(pool[0], 35), (pool[2], 30), (pool[5], 40)],
+            // input 4: ddd, eee, fff
+            vec![(pool[3], 25), (pool[4], 35), (pool[5], 20)],
+        ];
+        let total_input_rows: usize = inputs_spec
+            .iter()
+            .flat_map(|i| i.iter().map(|(_, r)| *r))
+            .sum();
+
+        let mut streams: Vec<Box<dyn ColumnPageStream>> = Vec::with_capacity(inputs_spec.len());
+        for spec in &inputs_spec {
+            let bytes = make_prefix_len_one_input(spec);
+            streams.push(open_stream(bytes).await);
+        }
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(streams, tmp.path(), &merge_config(4))
+            .await
+            .expect("production-shape merge");
+
+        // MC-1: rows preserved.
+        let total_output_rows: usize = outputs.iter().map(|o| o.num_rows).sum();
+        assert_eq!(
+            total_output_rows, total_input_rows,
+            "MC-1: input total {total_input_rows} != output total {total_output_rows}",
+        );
+
+        // Per-output checks. Capture each output's first and last
+        // sorted_series bytes for the MS-5 cross-output monotone check.
+        let mut last_ss_per_output: Vec<Vec<u8>> = Vec::with_capacity(outputs.len());
+        let mut first_ss_per_output: Vec<Vec<u8>> = Vec::with_capacity(outputs.len());
+
+        for (out_idx, output) in outputs.iter().enumerate() {
+            let bytes_out = std::fs::read(&output.path).expect("read");
+            let reader = SerializedFileReader::new(Bytes::from(bytes_out)).expect("ser");
+            let metadata = reader.metadata();
+
+            // MS-3: reported num_row_groups matches footer.
+            assert_eq!(
+                output.num_row_groups,
+                metadata.num_row_groups(),
+                "MS-3 violated for output {out_idx}: reported {} vs footer {}",
+                output.num_row_groups,
+                metadata.num_row_groups(),
+            );
+
+            // PA-1 + PA-3 chunk-level: each RG has constant + unique metric_name.
+            assert_unique_rg_prefix_keys(
+                metadata,
+                "metric_name|-timestamp_secs/V2",
+                1,
+                "test_f7_production_shape output",
+            )
+            .unwrap_or_else(|e| panic!("PA-1/PA-3 violated for output {out_idx}: {e}"));
+
+            // CS-1: MergeOutputFile reports the same prefix_len as the on-disk KV.
+            let kv_prefix_len: u32 = metadata
+                .file_metadata()
+                .key_value_metadata()
+                .and_then(|kvs| {
+                    kvs.iter()
+                        .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                        .and_then(|k| k.value.as_deref())
+                })
+                .map(|s| s.parse().unwrap_or(0))
+                .unwrap_or(0);
+            assert_eq!(
+                output.output_rg_partition_prefix_len, kv_prefix_len,
+                "CS-1 violated for output {out_idx}: MergeOutputFile reports {} vs KV {}",
+                output.output_rg_partition_prefix_len, kv_prefix_len,
+            );
+
+            // MC-3 within-file: sorted_series monotone across the
+            // whole output. Capture first/last sorted_series for the
+            // cross-output MS-5 check below.
+            let merged = read_output_to_record_batch(&output.path);
+            let ss_idx = merged.schema().index_of("sorted_series").expect("series");
+            let ss = merged
+                .column(ss_idx)
+                .as_any()
+                .downcast_ref::<BinaryArray>()
+                .expect("Binary");
+            for i in 0..merged.num_rows().saturating_sub(1) {
+                assert!(
+                    ss.value(i) <= ss.value(i + 1),
+                    "MC-3 within output {out_idx}: sorted_series decreased at row {i}",
+                );
+            }
+            first_ss_per_output.push(ss.value(0).to_vec());
+            last_ss_per_output.push(ss.value(merged.num_rows() - 1).to_vec());
+        }
+
+        // MS-5: across adjacent output files, sorted_series is
+        // monotone non-decreasing. Output K's last row's sorted_series
+        // must be ≤ output K+1's first row's sorted_series. A single
+        // metric CAN span outputs (engine splits at sorted_series
+        // transitions inside an overflowing region), so this is the
+        // cross-output ordering invariant — not "each metric in one
+        // output".
+        for i in 0..outputs.len().saturating_sub(1) {
+            assert!(
+                last_ss_per_output[i].as_slice() <= first_ss_per_output[i + 1].as_slice(),
+                "MS-5 violated: output {i}'s last sorted_series {:?} > output {}'s first \
+                 sorted_series {:?}",
+                last_ss_per_output[i],
+                i + 1,
+                first_ss_per_output[i + 1],
+            );
+        }
+    }
+
+    /// Focused unit test on the minimal proptest failure case — used
+    /// as a fast iteration channel for debugging.
+    #[tokio::test]
+    async fn test_f5_single_input_two_metrics_minimal() {
+        let bytes = make_prefix_len_one_input(&[("aaa.alpha", 10), ("bbb.beta", 11)]);
+        let streams: Vec<Box<dyn ColumnPageStream>> = vec![open_stream(bytes).await];
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs = streaming_merge_sorted_parquet_files(streams, tmp.path(), &merge_config(1))
+            .await
+            .expect("merge");
+        assert_eq!(outputs.len(), 1);
+        assert_eq!(outputs[0].num_rows, 21);
+
+        let merged = read_output_to_record_batch(&outputs[0].path);
+        let ss_idx = merged.schema().index_of("sorted_series").expect("series");
+        let ts_idx = merged
+            .schema()
+            .index_of("timestamp_secs")
+            .expect("timestamp");
+        let metric_idx = merged.schema().index_of("metric_name").expect("metric");
+        let ss = merged
+            .column(ss_idx)
+            .as_any()
+            .downcast_ref::<BinaryArray>()
+            .expect("Binary");
+        let ts = merged
+            .column(ts_idx)
+            .as_any()
+            .downcast_ref::<UInt64Array>()
+            .expect("UInt64");
+        let metric = merged.column(metric_idx);
+
+        for i in 0..merged.num_rows() {
+            let metric_str = match metric.data_type() {
+                DataType::Utf8 => metric
+                    .as_any()
+                    .downcast_ref::<StringArray>()
+                    .map(|a| a.value(i).to_string()),
+                DataType::Dictionary(_, _) => {
+                    use arrow::array::AsArray;
+                    let dict = metric.as_dictionary::<Int32Type>();
+                    let key = dict.keys().value(i);
+                    dict.values()
+                        .as_any()
+                        .downcast_ref::<StringArray>()
+                        .map(|a| a.value(key as usize).to_string())
+                }
+                _ => None,
+            };
+            eprintln!(
+                "row {i}: metric={metric_str:?} ss={:?} ts={}",
+                ss.value(i),
+                ts.value(i),
+            );
+        }
+
+        for i in 0..merged.num_rows().saturating_sub(1) {
+            assert!(
+                ss.value(i) <= ss.value(i + 1),
+                "sorted_series decreased at row {i}: {:?} > {:?}",
+                ss.value(i),
+                ss.value(i + 1),
+            );
+        }
+    }
+
+    /// Per-metric `sorted_series` base. In production the
+    /// `sorted_series` column is the order-preserving storekey
+    /// encoding of `(sort cols before timeseries_id, timeseries_id)`
+    /// — a concatenation, not a hash. Byte-level ordering matches
+    /// the logical (sort cols ASC, timeseries_id ASC) ordering, so
+    /// rows from different metric_names produce sorted_series
+    /// values from non-overlapping byte ranges: every "aaa.alpha"
+    /// row's sorted_series sorts before every "bbb.beta" row's.
+    /// The proptest fixture mirrors that property by giving each
+    /// metric a numeric base so its (base + row_offset) range
+    /// doesn't overlap the next metric's range.
+    fn proptest_metric_sorted_series_base(metric: &str) -> u64 {
+        match metric {
+            "aaa.alpha" => 0,
+            "bbb.beta" => 1_000_000,
+            "ccc.gamma" => 2_000_000,
+            "ddd.delta" => 3_000_000,
+            "eee.epsilon" => 4_000_000,
+            "fff.zeta" => 5_000_000,
+            other => panic!("unknown metric in proptest pool: {other}"),
+        }
+    }
+
+    /// Build a multi-RG fixture with prefix_len=1: one RG per
+    /// (metric_name, rows_per_rg) entry. Caller guarantees the
+    /// metric names are sorted ascending and unique within this file
+    /// (the streaming engine enforces both — duplicate prefix keys
+    /// are rejected as PA-3 violations, mis-sorted physical RG order
+    /// is rejected as MS-2). `sorted_series` is computed as
+    /// `metric_base + row_offset_within_metric`. Because the storekey
+    /// encoding of production `sorted_series` puts metric_name bytes
+    /// before tsid bytes, different metric_names yield non-overlapping
+    /// sorted_series ranges; this fixture preserves that property via
+    /// the per-metric base table. The same `(metric_name, row_offset)`
+    /// across two inputs gets the SAME sorted_series — that's the
+    /// realistic case where the same series appears in multiple
+    /// splits. Used by the F5 proptest.
+    fn make_prefix_len_one_input(rgs: &[(&str, usize)]) -> Bytes {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("sorted_series", DataType::Binary, false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("service", dict_type, true),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("value", DataType::Float64, false),
+        ]));
+
+        let make_batch = |metric: &str, start_series: u64, rows: usize| -> RecordBatch {
+            let metric_keys: Vec<i32> = vec![0; rows];
+            let metric_values = StringArray::from(vec![metric]);
+            let metric_name: ArrayRef = Arc::new(
+                DictionaryArray::<Int32Type>::try_new(
+                    arrow::array::Int32Array::from(metric_keys),
+                    Arc::new(metric_values),
+                )
+                .expect("dict"),
+            );
+            let timestamps: Vec<u64> = (0..rows as u64)
+                .map(|i| 1_700_000_000 + (rows as u64 - i))
+                .collect();
+            let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+            let series_bytes: Vec<Vec<u8>> = (0..rows as u64)
+                .map(|i| (start_series + i).to_be_bytes().to_vec())
+                .collect();
+            let sorted_series: ArrayRef = Arc::new(BinaryArray::from(
+                series_bytes
+                    .iter()
+                    .map(|v| v.as_slice())
+                    .collect::<Vec<_>>(),
+            ));
+            let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; rows]));
+            let svc_keys: Vec<Option<i32>> = (0..rows as i32).map(|i| Some(i % 3)).collect();
+            let svc_values = StringArray::from(vec!["api", "db", "cache"]);
+            let service: ArrayRef = Arc::new(
+                DictionaryArray::<Int32Type>::try_new(
+                    arrow::array::Int32Array::from(svc_keys),
+                    Arc::new(svc_values),
+                )
+                .expect("svc dict"),
+            );
+            let tsids: Vec<i64> = (0..rows as i64).map(|i| 1000 + i).collect();
+            let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+            let values: Vec<f64> = (0..rows).map(|i| i as f64).collect();
+            let value: ArrayRef = Arc::new(Float64Array::from(values));
+
+            RecordBatch::try_new(
+                schema.clone(),
+                vec![
+                    metric_name,
+                    timestamp_secs,
+                    sorted_series,
+                    metric_type,
+                    service,
+                    timeseries_id,
+                    value,
+                ],
+            )
+            .expect("prefix-len-1 batch")
+        };
+
+        // Make every batch land in its own RG by setting row_group_size
+        // larger than any single batch and calling flush() after each
+        // write — ArrowWriter rolls over the open RG on flush.
+        let max_rows = rgs.iter().map(|(_, n)| *n).max().unwrap_or(1);
+        let cfg = ParquetWriterConfig {
+            compression: Compression::Snappy,
+            row_group_size: max_rows.saturating_add(1),
+            ..ParquetWriterConfig::default()
+        };
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "metric_name|-timestamp_secs/V2".to_string(),
+            ),
+            KeyValue::new(
+                PARQUET_META_WINDOW_START.to_string(),
+                "1700000000".to_string(),
+            ),
+            KeyValue::new(PARQUET_META_WINDOW_DURATION.to_string(), "60".to_string()),
+            KeyValue::new(PARQUET_META_NUM_MERGE_OPS.to_string(), "0".to_string()),
+            KeyValue::new(
+                PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+                "1".to_string(),
+            ),
+        ];
+        let sorting_cols = vec![
+            parquet::file::metadata::SortingColumn {
+                column_idx: 0,
+                descending: false,
+                nulls_first: false,
+            },
+            parquet::file::metadata::SortingColumn {
+                column_idx: 1,
+                descending: true,
+                nulls_first: false,
+            },
+        ];
+        let props: WriterProperties = cfg.to_writer_properties_with_metadata(
+            &schema,
+            sorting_cols,
+            Some(kvs),
+            &["metric_name".to_string(), "timestamp_secs".to_string()],
+        );
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer =
+            ArrowWriter::try_new(&mut buf, schema.clone(), Some(props)).expect("arrow writer");
+        for (metric, rows) in rgs {
+            let start_series = proptest_metric_sorted_series_base(metric);
+            let batch = make_batch(metric, start_series, *rows);
+            writer.write(&batch).expect("write");
+            // Force a fresh RG for the next batch so each (metric,
+            // rows) entry maps to exactly one RG.
+            writer.flush().expect("flush rg");
+        }
+        writer.close().expect("close");
+        Bytes::from(buf)
+    }
+
+    /// F5 — prefix-aware proptest over the streaming engine.
+    ///
+    /// Sweeps `(num_inputs, per-input RG specs, num_outputs)` with
+    /// `prefix_len = 1` and asserts the merger's load-bearing
+    /// invariants on every generated case:
+    ///
+    /// - **MC-1**: total row count preserved across inputs → outputs.
+    /// - **MC-3**: each output is sorted on `(sorted_series ASC, timestamp DESC)`.
+    /// - **MS-3**: every output's `MergeOutputFile.num_row_groups` matches the on-disk footer.
+    /// - **PA-1 + PA-3**: every output's row groups pass `assert_unique_rg_prefix_keys`
+    ///   (intra-RG constancy + inter-RG uniqueness).
+    /// - **CS-1**: the metastore-recorded `rg_partition_prefix_len` matches the on-disk KV via
+    ///   the writer-stamped `MergeOutputFile.output_rg_partition_prefix_len`.
+    ///
+    /// Strategy:
+    /// - `num_inputs ∈ 1..=3`.
+    /// - Each input: 1..=4 RGs, each `(metric_name, rows)` where metric names come from a
+    ///   small pool, sorted-and-deduped within each input (so prefix_len=1's MS-2 + PA-3
+    ///   hold).
+    /// - `num_outputs ∈ 1..=3`.
+    ///
+    /// Cases capped at 32 so the proptest completes well under 30s
+    /// (each case opens streams + runs an async merge).
+    use proptest::test_runner::TestCaseError;
+    proptest::proptest! {
+        #![proptest_config(proptest::prelude::ProptestConfig {
+            cases: 32,
+            ..proptest::prelude::ProptestConfig::default()
+        })]
+
+        #[test]
+        fn prop_merge_prefix_aligned_streaming(
+            per_input_specs in proptest::collection::vec(
+                prefix_one_input_strategy(),
+                1usize..=3,
+            ),
+            num_outputs in 1usize..=3,
+        ) {
+            let rt = tokio::runtime::Builder::new_current_thread()
+                .enable_all()
+                .build()
+                .expect("tokio rt");
+            let outcome: std::result::Result<(), TestCaseError> =
+                rt.block_on(async move {
+                    run_prefix_aligned_case(per_input_specs, num_outputs).await
+                });
+            outcome?;
+        }
+    }
+
+    /// Per-input strategy: 1..=4 RGs, each `(metric_name, rows)`, then
+    /// dedup-by-metric and sort-by-metric so the resulting list
+    /// satisfies SS-1 (sorted) and PA-3 (unique prefixes within the
+    /// input).
+    fn prefix_one_input_strategy() -> impl proptest::strategy::Strategy<Value = Vec<(String, usize)>>
+    {
+        use proptest::prelude::*;
+        let metric_pool: &[&'static str] = &[
+            "aaa.alpha",
+            "bbb.beta",
+            "ccc.gamma",
+            "ddd.delta",
+            "eee.epsilon",
+            "fff.zeta",
+        ];
+        let metric_strat =
+            proptest::sample::select(metric_pool.to_vec()).prop_map(|s| s.to_string());
+        let row_count_strat = 10usize..=80;
+        let entry_strat = (metric_strat, row_count_strat);
+        proptest::collection::vec(entry_strat, 1usize..=4).prop_map(|mut rgs| {
+            rgs.sort_by(|a, b| a.0.cmp(&b.0));
+            rgs.dedup_by(|a, b| a.0 == b.0);
+            rgs
+        })
+    }
+
+    async fn run_prefix_aligned_case(
+        per_input_specs: Vec<Vec<(String, usize)>>,
+        num_outputs: usize,
+    ) -> std::result::Result<(), proptest::test_runner::TestCaseError> {
+        use proptest::prelude::*;
+
+        // Build each input's bytes via the prefix_len=1 fixture.
+        // Sorted_series numbering starts at 0 per input — production
+        // inputs can legitimately share sorted_series values across
+        // splits (sorted_series is a (sort cols + timeseries_id)
+        // hash, and the same series appears in multiple splits when
+        // the time window spans more than one ingest interval). The
+        // merger must handle this correctly; the MC-3 check below
+        // exercises that invariant.
+        let mut input_bytes: Vec<Bytes> = Vec::with_capacity(per_input_specs.len());
+        let mut total_input_rows: usize = 0;
+        for spec in &per_input_specs {
+            let rgs_borrowed: Vec<(&str, usize)> =
+                spec.iter().map(|(m, n)| (m.as_str(), *n)).collect();
+            total_input_rows += rgs_borrowed.iter().map(|(_, n)| *n).sum::<usize>();
+            input_bytes.push(make_prefix_len_one_input(&rgs_borrowed));
+        }
+
+        let mut streams: Vec<Box<dyn ColumnPageStream>> = Vec::with_capacity(input_bytes.len());
+        for bytes in input_bytes {
+            streams.push(open_stream(bytes).await);
+        }
+
+        let tmp = TempDir::new().expect("tmpdir");
+        let outputs =
+            streaming_merge_sorted_parquet_files(streams, tmp.path(), &merge_config(num_outputs))
+                .await
+                .map_err(|e| {
+                    proptest::test_runner::TestCaseError::fail(format!(
+                        "streaming merge failed: {e}"
+                    ))
+                })?;
+
+        // MC-1: total rows preserved.
+        let output_total: usize = outputs.iter().map(|o| o.num_rows).sum();
+        prop_assert_eq!(
+            output_total,
+            total_input_rows,
+            "MC-1 violated: input total = {}, output total = {}",
+            total_input_rows,
+            output_total
+        );
+
+        for (out_idx, output) in outputs.iter().enumerate() {
+            let bytes_out = std::fs::read(&output.path)
+                .map_err(|e| proptest::test_runner::TestCaseError::fail(e.to_string()))?;
+            let reader = SerializedFileReader::new(Bytes::from(bytes_out))
+                .map_err(|e| proptest::test_runner::TestCaseError::fail(e.to_string()))?;
+            let metadata = reader.metadata();
+
+            // MS-3: MergeOutputFile.num_row_groups == footer's RG count.
+            prop_assert_eq!(
+                output.num_row_groups,
+                metadata.num_row_groups(),
+                "MS-3 violated for output {}: MergeOutputFile.num_row_groups = {}, footer = {}",
+                out_idx,
+                output.num_row_groups,
+                metadata.num_row_groups()
+            );
+
+            // PA-1 + PA-3 chunk-level: every RG has constant metric_name AND unique across RGs.
+            assert_unique_rg_prefix_keys(
+                metadata,
+                "metric_name|-timestamp_secs/V2",
+                1,
+                "prop_merge_prefix_aligned_streaming output",
+            )
+            .map_err(|e| {
+                proptest::test_runner::TestCaseError::fail(format!(
+                    "PA-1/PA-3 violated for output {out_idx}: {e}"
+                ))
+            })?;
+
+            // CS-1: metastore-side prefix_len (output.output_rg_partition_prefix_len)
+            // matches on-disk KV.
+            let kv_prefix_len: u32 = metadata
+                .file_metadata()
+                .key_value_metadata()
+                .and_then(|kvs| {
+                    kvs.iter()
+                        .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                        .and_then(|k| k.value.as_deref())
+                })
+                .map(|s| s.parse().unwrap_or(0))
+                .unwrap_or(0);
+            prop_assert_eq!(
+                output.output_rg_partition_prefix_len,
+                kv_prefix_len,
+                "CS-1 violated for output {}: MergeOutputFile reports {}, KV reports {}",
+                out_idx,
+                output.output_rg_partition_prefix_len,
+                kv_prefix_len
+            );
+
+            // MC-3: row-by-row sort check on (sorted_series ASC, timestamp DESC within same
+            // series).
+            let merged = read_output_to_record_batch(&output.path);
+            let ss_idx = merged.schema().index_of("sorted_series").expect("series");
+            let ts_idx = merged
+                .schema()
+                .index_of("timestamp_secs")
+                .expect("timestamp");
+            let ss = merged
+                .column(ss_idx)
+                .as_any()
+                .downcast_ref::<BinaryArray>()
+                .expect("Binary");
+            let ts = merged
+                .column(ts_idx)
+                .as_any()
+                .downcast_ref::<UInt64Array>()
+                .expect("UInt64");
+            for i in 0..merged.num_rows().saturating_sub(1) {
+                let a = ss.value(i);
+                let b = ss.value(i + 1);
+                prop_assert!(
+                    a <= b,
+                    "MC-3 violated in output {}: sorted_series decreased at row {}",
+                    out_idx,
+                    i
+                );
+                if a == b {
+                    prop_assert!(
+                        ts.value(i) >= ts.value(i + 1),
+                        "MC-3 violated in output {}: timestamp not descending within series at \
+                         row {}",
+                        out_idx,
+                        i
+                    );
+                }
+            }
+        }
+
+        Ok(())
+    }
+}
diff --git a/quickwit/quickwit-parquet-engine/src/merge/streaming/body_assembler.rs b/quickwit/quickwit-parquet-engine/src/merge/streaming/body_assembler.rs
new file mode 100644
index 00000000000..5f6bb83ba0e
--- /dev/null
+++ b/quickwit/quickwit-parquet-engine/src/merge/streaming/body_assembler.rs
@@ -0,0 +1,494 @@
+// Copyright 2021-Present Datadog, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! Page-bounded body column assembly — the streaming core.
+//!
+//! For each `(output_idx, body_col)` pair the engine builds a
+//! [`BodyColOutputPageAssembler`] that yields one assembled output
+//! page per `Iterator::next` call. Each call:
+//!
+//! 1. Walks the destinations table to pick `(input_idx, input_row)` pairs that map to the next
+//!    `OUTPUT_PAGE_ROWS` positions of this output.
+//! 2. Advances each contributing input's decoder forward until its cached pages cover the needed
+//!    rows. The decoder + page cache live on [`InputDecoderState`] (not the assembler) so a page
+//!    whose row range straddles two outputs survives into the next output's assembler — the stream
+//!    cannot be rewound.
+//! 3. Concatenates each input's cached pages and calls `arrow::compute::interleave` to assemble the
+//!    output page.
+//! 4. Evicts pages whose last row falls below the cursor.
+//!
+//! Memory bound per `next()`: one in-progress output page plus a few
+//! in-flight input pages per input — never a whole column chunk. See
+//! the MS-7 test in the parent test module for the runtime assertion.
+//!
+//! [`StreamingBodyColIter`] wraps the `Result<ArrayRef>` page stream
+//! into the `Iterator<Item = ArrayRef>` shape
+//! `write_next_column_arrays` expects, capturing the first assembly
+//! error in a side cell.
+
+use std::collections::HashSet;
+
+use anyhow::{Context, Result, anyhow, bail};
+use arrow::array::{Array, ArrayRef, new_null_array};
+use arrow::compute::interleave;
+use arrow::datatypes::{DataType, Field};
+use tokio::runtime::Handle;
+
+use super::{
+    InputDecoderState, InputRowDestinations, OUTPUT_PAGE_ROWS, record_body_col_page_cache_len,
+};
+
+/// Adapts a `Result<ArrayRef>` page assembler into the
+/// `Iterator<Item = ArrayRef>` shape `write_next_column_arrays` expects.
+/// The first assembly error is captured in `error_slot` and iteration
+/// ends; the caller MUST check the slot after the writer returns. If
+/// `service_collector` is `Some`, every yielded page is scanned for
+/// service names and added to the set; collection failures also stop
+/// the iterator and populate `error_slot`.
+pub(crate) struct StreamingBodyColIter<'a, I> {
+    pub(crate) inner: I,
+    pub(crate) error_slot: &'a mut Option<anyhow::Error>,
+    pub(crate) service_collector: Option<&'a mut HashSet<String>>,
+}
+
+impl<I> Iterator for StreamingBodyColIter<'_, I>
+where I: Iterator<Item = Result<ArrayRef>>
+{
+    type Item = ArrayRef;
+
+    fn next(&mut self) -> Option<ArrayRef> {
+        if self.error_slot.is_some() {
+            return None;
+        }
+        match self.inner.next() {
+            Some(Ok(arr)) => {
+                if let Some(out) = self.service_collector.as_deref_mut()
+                    && let Err(e) = collect_service_names_from_page(arr.as_ref(), out)
+                {
+                    *self.error_slot = Some(e);
+                    return None;
+                }
+                Some(arr)
+            }
+            Some(Err(e)) => {
+                *self.error_slot = Some(e);
+                None
+            }
+            None => None,
+        }
+    }
+}
+
+/// Per-page service name collector. Used during the streaming write
+/// of the "service" body col to populate per-output service_names.
+pub(crate) fn collect_service_names_from_page(
+    arr: &dyn Array,
+    out: &mut HashSet<String>,
+) -> Result<()> {
+    use arrow::array::AsArray;
+    use arrow::datatypes::{Int8Type, Int16Type, Int32Type, Int64Type};
+
+    fn extend_from_strings(strings: &arrow::array::StringArray, out: &mut HashSet<String>) {
+        for i in 0..strings.len() {
+            if strings.is_valid(i) {
+                out.insert(strings.value(i).to_string());
+            }
+        }
+    }
+
+    match arr.data_type() {
+        DataType::Utf8 => {
+            let strings = arr
+                .as_any()
+                .downcast_ref::<arrow::array::StringArray>()
+                .ok_or_else(|| anyhow!("expected StringArray for service col page"))?;
+            extend_from_strings(strings, out);
+        }
+        DataType::LargeUtf8 => {
+            let strings = arr
+                .as_any()
+                .downcast_ref::<arrow::array::LargeStringArray>()
+                .ok_or_else(|| anyhow!("expected LargeStringArray for service col page"))?;
+            for i in 0..strings.len() {
+                if strings.is_valid(i) {
+                    out.insert(strings.value(i).to_string());
+                }
+            }
+        }
+        DataType::Dictionary(key_type, value_type)
+            if matches!(value_type.as_ref(), DataType::Utf8) =>
+        {
+            // Extract the dictionary's values that are referenced by
+            // valid (non-null) keys.
+            match key_type.as_ref() {
+                DataType::Int8 => {
+                    let dict = arr.as_dictionary::<Int8Type>();
+                    if let Some(strings) = dict
+                        .values()
+                        .as_any()
+                        .downcast_ref::<arrow::array::StringArray>()
+                    {
+                        for i in 0..dict.len() {
+                            if dict.is_valid(i) {
+                                let key = dict.keys().value(i) as usize;
+                                if key < strings.len() && strings.is_valid(key) {
+                                    out.insert(strings.value(key).to_string());
+                                }
+                            }
+                        }
+                    }
+                }
+                DataType::Int16 => {
+                    let dict = arr.as_dictionary::<Int16Type>();
+                    if let Some(strings) = dict
+                        .values()
+                        .as_any()
+                        .downcast_ref::<arrow::array::StringArray>()
+                    {
+                        for i in 0..dict.len() {
+                            if dict.is_valid(i) {
+                                let key = dict.keys().value(i) as usize;
+                                if key < strings.len() && strings.is_valid(key) {
+                                    out.insert(strings.value(key).to_string());
+                                }
+                            }
+                        }
+                    }
+                }
+                DataType::Int32 => {
+                    let dict = arr.as_dictionary::<Int32Type>();
+                    if let Some(strings) = dict
+                        .values()
+                        .as_any()
+                        .downcast_ref::<arrow::array::StringArray>()
+                    {
+                        for i in 0..dict.len() {
+                            if dict.is_valid(i) {
+                                let key = dict.keys().value(i) as usize;
+                                if key < strings.len() && strings.is_valid(key) {
+                                    out.insert(strings.value(key).to_string());
+                                }
+                            }
+                        }
+                    }
+                }
+                DataType::Int64 => {
+                    let dict = arr.as_dictionary::<Int64Type>();
+                    if let Some(strings) = dict
+                        .values()
+                        .as_any()
+                        .downcast_ref::<arrow::array::StringArray>()
+                    {
+                        for i in 0..dict.len() {
+                            if dict.is_valid(i) {
+                                let key = dict.keys().value(i) as usize;
+                                if key < strings.len() && strings.is_valid(key) {
+                                    out.insert(strings.value(key).to_string());
+                                }
+                            }
+                        }
+                    }
+                }
+                _ => {}
+            }
+        }
+        _ => {
+            // Skip non-string types — service col is expected to be
+            // string-like; if it isn't, just don't collect names.
+        }
+    }
+    Ok(())
+}
+
+/// Assembles output pages for one (output_idx, body_col). See the
+/// module docs for the full contract.
+pub(crate) struct BodyColOutputPageAssembler<'a> {
+    handle: &'a Handle,
+    decoders_state: &'a mut [InputDecoderState],
+    input_col_indices: &'a [Option<usize>],
+    destinations: &'a InputRowDestinations,
+    out_idx: usize,
+    col_name: &'a str,
+    out_field: &'a Field,
+    /// Total rows written so far for this output's col.
+    rows_emitted: usize,
+    /// Total rows expected = destinations.rows_per_output[out_idx].
+    expected_rows: usize,
+    /// EOF flag (returns None on subsequent calls once true).
+    done: bool,
+}
+
+impl<'a> BodyColOutputPageAssembler<'a> {
+    #[allow(clippy::too_many_arguments)]
+    pub(crate) fn new(
+        handle: &'a Handle,
+        decoders_state: &'a mut [InputDecoderState],
+        input_col_indices: &'a [Option<usize>],
+        destinations: &'a InputRowDestinations,
+        out_idx: usize,
+        col_name: &'a str,
+        out_field: &'a Field,
+    ) -> Self {
+        Self {
+            handle,
+            decoders_state,
+            input_col_indices,
+            destinations,
+            out_idx,
+            col_name,
+            out_field,
+            rows_emitted: 0,
+            expected_rows: destinations.rows_per_output[out_idx],
+            done: false,
+        }
+    }
+
+    pub(crate) fn into_iter(self) -> BodyColOutputPageIter<'a> {
+        BodyColOutputPageIter { inner: self }
+    }
+}
+
+pub(crate) struct BodyColOutputPageIter<'a> {
+    inner: BodyColOutputPageAssembler<'a>,
+}
+
+impl Iterator for BodyColOutputPageIter<'_> {
+    type Item = Result<ArrayRef>;
+
+    fn next(&mut self) -> Option<Self::Item> {
+        // `assemble_one_output_page` only advances `rows_emitted` by
+        // `page_size = remaining.min(OUTPUT_PAGE_ROWS)` where
+        // `remaining = expected_rows - rows_emitted`, so the increment
+        // never overshoots. `emitted > expected` would mean a real
+        // accounting bug — surface it loudly rather than silently
+        // terminating one iteration too late.
+        debug_assert!(
+            self.inner.rows_emitted <= self.inner.expected_rows,
+            "rows_emitted {} > expected_rows {} for output {}",
+            self.inner.rows_emitted,
+            self.inner.expected_rows,
+            self.inner.out_idx,
+        );
+        if self.inner.done || self.inner.rows_emitted == self.inner.expected_rows {
+            self.inner.done = true;
+            return None;
+        }
+        match assemble_one_output_page(&mut self.inner) {
+            Ok(Some(arr)) => Some(Ok(arr)),
+            Ok(None) => {
+                self.inner.done = true;
+                None
+            }
+            Err(e) => {
+                self.inner.done = true;
+                Some(Err(e))
+            }
+        }
+    }
+}
+
+fn assemble_one_output_page(s: &mut BodyColOutputPageAssembler) -> Result<Option<ArrayRef>> {
+    let remaining = s.expected_rows - s.rows_emitted;
+    if remaining == 0 {
+        return Ok(None);
+    }
+    let page_size = remaining.min(OUTPUT_PAGE_ROWS);
+
+    // Collect (input_idx, input_row) indices for this output page by
+    // walking destinations.per_input from each input's cursor. For
+    // each target output position, find the (input, row) that maps to
+    // it. See module docs for why this is correct given the merge
+    // plan invariants.
+    let mut indices_per_input: Vec<Vec<usize>> = vec![Vec::new(); s.decoders_state.len()];
+    let mut interleave_indices: Vec<(usize, usize)> = Vec::with_capacity(page_size);
+    let mut total_picked = 0usize;
+
+    while total_picked < page_size {
+        let target_pos = s.rows_emitted + total_picked;
+        let mut found = false;
+        for (input_idx, dests) in s.destinations.per_input.iter().enumerate() {
+            let cursor = match s.input_col_indices[input_idx] {
+                Some(col_parquet_idx) => {
+                    s.decoders_state[input_idx].body_col_cursor(col_parquet_idx)
+                }
+                None => 0,
+            };
+            for (input_row, dest) in dests.iter().enumerate().skip(cursor) {
+                match dest {
+                    Some((o, p)) if *o == s.out_idx && *p == target_pos => {
+                        interleave_indices.push((input_idx, input_row));
+                        indices_per_input[input_idx].push(input_row);
+                        found = true;
+                        break;
+                    }
+                    _ => {}
+                }
+                if found {
+                    break;
+                }
+            }
+            if found {
+                break;
+            }
+        }
+        if !found {
+            // Shouldn't happen — every output position should be reachable.
+            bail!(
+                "merge plan inconsistency: output {} position {target_pos} not found in any input",
+                s.out_idx,
+            );
+        }
+        total_picked += 1;
+    }
+
+    // Advance each contributing input's decoder until its cached
+    // pages cover the maximum input row we'll reference.
+    for (input_idx, input_rows) in indices_per_input.iter().enumerate() {
+        if input_rows.is_empty() {
+            continue;
+        }
+        let col_parquet_idx = match s.input_col_indices[input_idx] {
+            Some(c) => c,
+            None => continue,
+        };
+        let max_needed_row = *input_rows.iter().max().expect("non-empty");
+        fill_page_cache_to_row(
+            s.handle,
+            &mut s.decoders_state[input_idx],
+            col_parquet_idx,
+            max_needed_row,
+        )?;
+    }
+
+    // Build the per-input value array. For inputs lacking this col,
+    // a single null-row placeholder routes interleave indices to position 0.
+    let mut input_array_refs: Vec<ArrayRef> = Vec::with_capacity(s.decoders_state.len());
+    let mut input_cache_starts: Vec<usize> = Vec::with_capacity(s.decoders_state.len());
+
+    for input_idx in 0..s.decoders_state.len() {
+        match s.input_col_indices[input_idx] {
+            Some(col_parquet_idx) => {
+                let pages = s.decoders_state[input_idx].body_col_cache(col_parquet_idx);
+                if pages.is_empty() {
+                    input_array_refs.push(new_null_array(s.out_field.data_type(), 0));
+                    input_cache_starts.push(0);
+                } else {
+                    let cache_start = pages[0].row_start;
+                    let arrays: Vec<&dyn Array> = pages.iter().map(|p| p.array.as_ref()).collect();
+                    let concatenated = arrow::compute::concat(&arrays).with_context(|| {
+                        format!(
+                            "concatenating cached pages for input {input_idx} col '{}'",
+                            s.col_name,
+                        )
+                    })?;
+                    input_array_refs.push(concatenated);
+                    input_cache_starts.push(cache_start);
+                }
+            }
+            None => {
+                let null_arr = new_null_array(s.out_field.data_type(), 1);
+                input_array_refs.push(null_arr);
+                input_cache_starts.push(0);
+            }
+        }
+    }
+
+    let interleave_local: Vec<(usize, usize)> = interleave_indices
+        .iter()
+        .map(|&(i_idx, i_row)| match s.input_col_indices[i_idx] {
+            Some(_) => (i_idx, i_row - input_cache_starts[i_idx]),
+            None => (i_idx, 0),
+        })
+        .collect();
+
+    let array_refs_ref: Vec<&dyn Array> = input_array_refs.iter().map(|a| a.as_ref()).collect();
+    let assembled = interleave(&array_refs_ref, &interleave_local).with_context(|| {
+        format!(
+            "interleaving body col '{}' for output {}",
+            s.col_name, s.out_idx,
+        )
+    })?;
+
+    // Bump cursors past consumed rows and evict pages whose last row
+    // falls below the cursor. Both live on InputDecoderState (per
+    // col) so they persist across regions/outputs that share this
+    // column.
+    for (input_idx, input_rows) in indices_per_input.iter().enumerate() {
+        if input_rows.is_empty() {
+            continue;
+        }
+        let max_row = *input_rows.iter().max().expect("non-empty");
+        let state = &mut s.decoders_state[input_idx];
+        let Some(col_parquet_idx) = s.input_col_indices[input_idx] else {
+            continue;
+        };
+        state.set_body_col_cursor(col_parquet_idx, max_row + 1);
+    }
+
+    s.rows_emitted += page_size;
+    Ok(Some(assembled))
+}
+
+/// Pull pages from the input's persistent decoder via `block_on` until
+/// the cached pages for `col_parquet_idx` cover up through `target_row`
+/// (inclusive). Stops as soon as the latest cached page ends past
+/// `target_row`. The function's effect on the world is *adding pages
+/// to the cache* — it does not skip data and does not consume any
+/// rows on its own.
+///
+/// The decoder MUST be the long-lived [`InputDecoderState::decoder`]:
+/// it preserves the per-(rg, col) `rows_decoded` counter so successive
+/// `DecodedPage::row_start` values are absolute input row indices,
+/// not page-local zeros. Likewise, the cache lives on the state so
+/// pages whose row range spans an output boundary survive into the
+/// next output's assembler.
+fn fill_page_cache_to_row(
+    handle: &Handle,
+    state: &mut InputDecoderState,
+    col_parquet_idx: usize,
+    target_row: usize,
+) -> Result<()> {
+    // Already covered by what's cached for this col?
+    if let Some(last) = state.body_col_cache(col_parquet_idx).last() {
+        let last_end = last.row_start + last.array.len();
+        if target_row < last_end {
+            return Ok(());
+        }
+    }
+
+    // Drain pages from the stream until either the target is covered
+    // for `col_parquet_idx` or the stream runs out. Pages emitted for
+    // a different col_idx still get cached under their own col so a
+    // later request for that col can find them without re-fetching —
+    // critical for the synthesized-prefix path, which re-reads earlier
+    // cols across multiple regions after the stream has moved on.
+    loop {
+        let decoded = handle
+            .block_on(state.decoder.decode_next_page())
+            .context("decoding body col page")?;
+        let page = match decoded {
+            Some(p) => p,
+            None => bail!(
+                "stream EOF while advancing to row {target_row} for parquet col {col_parquet_idx}",
+            ),
+        };
+        let page_col = page.col_idx;
+        let end = page.row_start + page.array.len();
+        state.body_col_cache_mut(page_col).push(page);
+        record_body_col_page_cache_len(state.body_col_caches_total_len());
+        if page_col == col_parquet_idx && target_row < end {
+            return Ok(());
+        }
+    }
+}
diff --git a/quickwit/quickwit-parquet-engine/src/merge/streaming/output.rs b/quickwit/quickwit-parquet-engine/src/merge/streaming/output.rs
new file mode 100644
index 00000000000..071a5030dcd
--- /dev/null
+++ b/quickwit/quickwit-parquet-engine/src/merge/streaming/output.rs
@@ -0,0 +1,365 @@
+// Copyright 2021-Present Datadog, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! Per-output state, writer setup, and finalization for the streaming
+//! merge engine.
+//!
+//! Each output file in a streaming merge owns:
+//! - An [`OutputWriterStorage`] holding the live parquet writer and a running count of row groups
+//!   written to it (used to verify MS-3).
+//! - An [`OutputAccumulator`] that concatenates this output's sort-col contributions across regions
+//!   so per-output metadata (`qh.row_keys`, `qh.zonemap_regexes`, metric_names, time_range) can be
+//!   computed from the output's actual rows at finalize time.
+//!
+//! Per-output `qh.row_keys` and `qh.zonemap_regexes` are *appended* to
+//! the parquet footer just before close — they cannot be set at writer
+//! construction because they depend on row content that the streaming
+//! pass only finishes accumulating in `finalize_output`. The static
+//! KV entries (sort schema, window, num_merge_ops, prefix_len) are set
+//! at construction since they're identical regardless of row content.
+
+use std::collections::{HashMap, HashSet};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+
+use anyhow::{Context, Result};
+use arrow::array::RecordBatch;
+use arrow::datatypes::{Schema as ArrowSchema, SchemaRef};
+use base64::Engine;
+use base64::engine::general_purpose::STANDARD as BASE64;
+use ulid::Ulid;
+
+use super::super::writer::{build_merge_kv_metadata, resolve_sort_field_names};
+use super::super::{InputMetadata, MergeOutputFile};
+use crate::row_keys;
+use crate::sort_fields::{is_timestamp_column_name, parse_sort_fields};
+use crate::split::TAG_SERVICE;
+use crate::storage::split_writer::{extract_metric_names, extract_time_range};
+use crate::storage::streaming_writer::StreamingParquetWriter;
+use crate::storage::{
+    PARQUET_META_ROW_KEYS, PARQUET_META_ROW_KEYS_JSON, PARQUET_META_ZONEMAP_REGEXES,
+};
+use crate::zonemap::{self, ZonemapOptions};
+
+/// Per-output-file mutable state owned across regions assigned to that file.
+pub(crate) struct OutputWriterStorage {
+    pub(crate) output_idx: usize,
+    pub(crate) output_path: PathBuf,
+    pub(crate) writer: StreamingParquetWriter<std::fs::File>,
+    /// Number of parquet row groups written into this file so far. One
+    /// row group is opened per region in the per-region processor, so
+    /// this equals the count of regions assigned to this output once
+    /// processing is done. Reported on [`MergeOutputFile::num_row_groups`].
+    pub(crate) num_row_groups: usize,
+}
+
+/// Per-output-file accumulator. Each region's sort-col contribution is
+/// merged into `accumulated_sort_batch`; per-output metadata
+/// (row_keys, zonemap, metric_names, time_range) is computed once at
+/// `finalize_output` time. Service names are collected during the
+/// streaming write of the service body col within each region.
+pub(crate) struct OutputAccumulator {
+    pub(crate) output_idx: usize,
+    /// Concatenated sort-col `RecordBatch` across all regions written
+    /// to this output. Memory bounded by total sort col bytes in the
+    /// output file (small — sort cols are narrow).
+    pub(crate) accumulated_sort_batch: Option<RecordBatch>,
+    /// Service names collected across regions' body-col writes for
+    /// this output file.
+    pub(crate) service_names: HashSet<String>,
+    /// Cumulative row count = sum of regions' total_rows assigned here.
+    pub(crate) num_rows: usize,
+}
+
+impl OutputAccumulator {
+    pub(crate) fn new(output_idx: usize) -> Self {
+        Self {
+            output_idx,
+            accumulated_sort_batch: None,
+            service_names: HashSet::new(),
+            num_rows: 0,
+        }
+    }
+
+    pub(crate) fn append_sort_batch(&mut self, batch: RecordBatch) -> Result<()> {
+        match self.accumulated_sort_batch.take() {
+            None => {
+                self.accumulated_sort_batch = Some(batch);
+            }
+            Some(prev) => {
+                let schema = prev.schema();
+                let combined = arrow::compute::concat_batches(&schema, [&prev, &batch])
+                    .context("appending region sort batch to output accumulator")?;
+                self.accumulated_sort_batch = Some(combined);
+            }
+        }
+        Ok(())
+    }
+}
+
+/// Open a streaming Parquet writer for one output file. Caller is
+/// responsible for calling `start_row_group` per region and writing
+/// columns.
+pub(crate) fn open_output_writer_for_streaming(
+    output_idx: usize,
+    output_dir: &Path,
+    union_schema: &SchemaRef,
+    input_meta: &InputMetadata,
+    writer_config: &crate::storage::ParquetWriterConfig,
+) -> Result<OutputWriterStorage> {
+    let output_prefix_len = input_meta.rg_partition_prefix_len;
+    // `qh.row_keys` and `qh.zonemap_regexes` MUST be derived from the
+    // rows that end up in THIS output file, not from inputs — the
+    // merge eliminates key overlap between outputs, so an output's
+    // key metadata can be very different from any input's. We can't
+    // compute those values until every region has been written, so
+    // they are appended to the file's footer KV metadata in
+    // `finalize_output` via `StreamingParquetWriter::append_key_value_metadata`
+    // just before close. The KV entries set here cover the "static"
+    // keys (sort_fields, window, num_merge_ops, prefix_len) that are
+    // identical regardless of which rows the output contains.
+    let kv_entries = build_merge_kv_metadata(input_meta, &None, &HashMap::new(), output_prefix_len);
+
+    let sort_field_names = resolve_sort_field_names(&input_meta.sort_fields)?;
+    let sorting_cols = build_sorting_columns_from_schema(union_schema, &input_meta.sort_fields)?;
+
+    let props = writer_config.to_writer_properties_with_metadata(
+        union_schema,
+        sorting_cols,
+        Some(kv_entries),
+        &sort_field_names,
+    );
+
+    let output_filename = format!("merge_output_{}.parquet", Ulid::new());
+    let output_path = output_dir.join(&output_filename);
+    let file = std::fs::File::create(&output_path)
+        .with_context(|| format!("creating output file: {}", output_path.display()))?;
+    let writer = StreamingParquetWriter::try_new(file, Arc::clone(union_schema), props)
+        .with_context(|| format!("opening streaming writer for output {output_idx}"))?;
+
+    Ok(OutputWriterStorage {
+        output_idx,
+        output_path,
+        writer,
+        num_row_groups: 0,
+    })
+}
+
+/// Compute `SortingColumn` entries from the union schema (no
+/// RecordBatch needed — we just need the col indices).
+pub(crate) fn build_sorting_columns_from_schema(
+    schema: &SchemaRef,
+    sort_fields_str: &str,
+) -> Result<Vec<parquet::file::metadata::SortingColumn>> {
+    let parsed = parse_sort_fields(sort_fields_str)?;
+    let mut cols = Vec::new();
+    for sf in &parsed.column {
+        // Schema may use `timestamp_secs` for what the sort schema
+        // calls `timestamp`. Match the existing alias handling.
+        let resolved =
+            if is_timestamp_column_name(&sf.name) && schema.index_of("timestamp_secs").is_ok() {
+                "timestamp_secs"
+            } else {
+                sf.name.as_str()
+            };
+        let Ok(col_idx) = schema.index_of(resolved) else {
+            continue;
+        };
+        cols.push(parquet::file::metadata::SortingColumn {
+            column_idx: col_idx as i32,
+            descending: sf.sort_direction
+                == quickwit_proto::sortschema::SortColumnDirection::SortDirectionDescending as i32,
+            nulls_first: false,
+        });
+    }
+    Ok(cols)
+}
+
+/// Finalize one output file: close writer, gather size, compute
+/// per-output static metadata from the accumulator's sort col data,
+/// return the `MergeOutputFile`.
+pub(crate) fn finalize_output(
+    writer_state: OutputWriterStorage,
+    accumulator: OutputAccumulator,
+    input_meta: &InputMetadata,
+) -> Result<MergeOutputFile> {
+    let OutputWriterStorage {
+        output_idx,
+        output_path,
+        mut writer,
+        num_row_groups,
+    } = writer_state;
+
+    // Compute per-output metadata from the rows that actually landed
+    // in THIS output file. Merging eliminates key overlap between
+    // outputs, so the row_keys / zonemap / metric_names / time_range
+    // each output advertises must come from its own accumulated sort
+    // batch — they cannot be carried over from any input file.
+    let sort_batch = accumulator
+        .accumulated_sort_batch
+        .unwrap_or_else(|| RecordBatch::new_empty(Arc::new(ArrowSchema::empty())));
+
+    let row_keys_proto = if sort_batch.num_rows() > 0 {
+        row_keys::extract_row_keys(&input_meta.sort_fields, &sort_batch)
+            .with_context(|| format!("extracting row keys for output {output_idx}"))?
+            .map(|rk| row_keys::encode_row_keys_proto(&rk))
+    } else {
+        None
+    };
+
+    let zonemap_opts = ZonemapOptions::default();
+    let zonemap_regexes = if sort_batch.num_rows() > 0 {
+        zonemap::extract_zonemap_regexes(&input_meta.sort_fields, &sort_batch, &zonemap_opts)
+            .with_context(|| format!("extracting zonemap regexes for output {output_idx}"))?
+    } else {
+        HashMap::new()
+    };
+
+    let metric_names =
+        if sort_batch.num_rows() > 0 && sort_batch.schema().index_of("metric_name").is_ok() {
+            extract_metric_names(&sort_batch)
+                .with_context(|| format!("extracting metric names for output {output_idx}"))?
+        } else {
+            HashSet::new()
+        };
+
+    let time_range =
+        if sort_batch.num_rows() > 0 && sort_batch.schema().index_of("timestamp_secs").is_ok() {
+            extract_time_range(&sort_batch)
+                .with_context(|| format!("extracting time range for output {output_idx}"))?
+        } else {
+            crate::split::TimeRange::new(0, 0)
+        };
+
+    // Write the per-output `qh.row_keys` / `qh.zonemap_regexes` into
+    // the file's KV metadata so downstream tools that read the
+    // Parquet footer directly see the same values that
+    // `MergeOutputFile` carries in memory.
+    append_per_output_kv_metadata(&mut writer, row_keys_proto.as_ref(), &zonemap_regexes);
+
+    let footer_metadata = writer
+        .close()
+        .with_context(|| format!("closing writer for output {output_idx}"))?;
+
+    // MS-3: the count we report on `MergeOutputFile.num_row_groups`
+    // must agree with what the on-disk parquet footer actually
+    // contains. `num_row_groups` is bumped per `start_row_group` call
+    // in the per-region processor; if we ever skip one or double-
+    // count, this catches the drift in debug builds before downstream
+    // metadata consumers see the inconsistency.
+    debug_assert_eq!(
+        footer_metadata.num_row_groups(),
+        num_row_groups,
+        "MergeOutputFile.num_row_groups ({num_row_groups}) disagrees with footer ({})",
+        footer_metadata.num_row_groups(),
+    );
+
+    // Strong invariant: the streaming engine pairs at most one input
+    // RG per region (`process_region` keys `sort_col_batches` by
+    // input_idx), and the read-path check in
+    // `extract_regions_from_metadata` enforces unique prefix keys per
+    // input. The output writer assigns one new RG per region in
+    // `process_region`, so the output should never violate the
+    // invariant either — but verifying here catches any future
+    // regression (e.g., a finalize-time rewrite that splits a region
+    // across RGs) before the file lands.
+    super::region_grouping::assert_unique_rg_prefix_keys(
+        &footer_metadata,
+        &input_meta.sort_fields,
+        input_meta.rg_partition_prefix_len,
+        &format!("merge output {}", output_path.display()),
+    )?;
+
+    let size_bytes = std::fs::metadata(&output_path)
+        .with_context(|| format!("stat output file: {}", output_path.display()))?
+        .len();
+
+    // If `service` is a sort column for this schema, it took the
+    // sort-col write path in `process_region` and the body-col
+    // `track_service` branch never saw it. Fold in the names from the
+    // accumulated sort batch so `TAG_SERVICE` metadata stays accurate
+    // regardless of which path wrote the column.
+    let mut service_names = accumulator.service_names;
+    if sort_batch.num_rows() > 0
+        && let Ok(service_col_idx) = sort_batch.schema().index_of("service")
+    {
+        super::body_assembler::collect_service_names_from_page(
+            sort_batch.column(service_col_idx).as_ref(),
+            &mut service_names,
+        )
+        .with_context(|| {
+            format!("collecting service names from sort col for output {output_idx}")
+        })?;
+    }
+
+    let mut low_cardinality_tags: HashMap<String, HashSet<String>> = HashMap::new();
+    if !service_names.is_empty() {
+        low_cardinality_tags.insert(TAG_SERVICE.to_string(), service_names);
+    }
+
+    Ok(MergeOutputFile {
+        path: output_path,
+        num_rows: accumulator.num_rows,
+        num_row_groups,
+        // The streaming engine stamps `input_meta.rg_partition_prefix_len`
+        // unconditionally in the file's KV (see
+        // `open_output_writer_for_streaming`) and verifies the claim with
+        // `assert_unique_rg_prefix_keys`. CS-1 requires the metastore-
+        // recorded value to match the on-disk KV, so propagate the same
+        // value here.
+        output_rg_partition_prefix_len: input_meta.rg_partition_prefix_len,
+        size_bytes,
+        row_keys_proto,
+        zonemap_regexes,
+        metric_names,
+        time_range,
+        low_cardinality_tags,
+    })
+}
+
+/// Append the per-output `qh.row_keys` / `qh.zonemap_regexes` KV
+/// entries to the streaming writer just before close. Encoded the
+/// same way as the non-streaming writer (`build_merge_kv_metadata`):
+/// base64 for the proto, JSON for the zonemap map, plus the optional
+/// human-readable `qh.row_keys_json`.
+fn append_per_output_kv_metadata(
+    writer: &mut StreamingParquetWriter<std::fs::File>,
+    row_keys_proto: Option<&Vec<u8>>,
+    zonemap_regexes: &HashMap<String, String>,
+) {
+    if let Some(rk_bytes) = row_keys_proto {
+        writer.append_key_value_metadata(parquet::file::metadata::KeyValue::new(
+            PARQUET_META_ROW_KEYS.to_string(),
+            BASE64.encode(rk_bytes),
+        ));
+        if let Ok(rk) =
+            <quickwit_proto::sortschema::RowKeys as prost::Message>::decode(rk_bytes.as_slice())
+            && let Ok(json) = serde_json::to_string(&rk)
+        {
+            writer.append_key_value_metadata(parquet::file::metadata::KeyValue::new(
+                PARQUET_META_ROW_KEYS_JSON.to_string(),
+                json,
+            ));
+        }
+    }
+
+    if !zonemap_regexes.is_empty() {
+        let json = serde_json::to_string(zonemap_regexes)
+            .expect("HashMap<String, String> JSON serialization cannot fail");
+        writer.append_key_value_metadata(parquet::file::metadata::KeyValue::new(
+            PARQUET_META_ZONEMAP_REGEXES.to_string(),
+            json,
+        ));
+    }
+}
diff --git a/quickwit/quickwit-parquet-engine/src/merge/streaming/region_grouping.rs b/quickwit/quickwit-parquet-engine/src/merge/streaming/region_grouping.rs
new file mode 100644
index 00000000000..0e6eeef5979
--- /dev/null
+++ b/quickwit/quickwit-parquet-engine/src/merge/streaming/region_grouping.rs
@@ -0,0 +1,836 @@
+// Copyright 2021-Present Datadog, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! Region grouping for multi-RG streaming merge.
+//!
+//! A "region" is one merge unit: a contiguous slice of the merged
+//! output where all contributing inputs share the same sort-prefix
+//! value. For multi-RG metric-aligned inputs each region pairs with
+//! at most one row group per input, which is the property that makes
+//! per-region streaming work without column-chunk-bounded buffering.
+//!
+//! This module owns the composite prefix-key encoding (sort-order
+//! preserving across multiple columns / per-type encoding rules) and
+//! the BTreeMap-based region grouping. It also enforces MS-2: the
+//! BTreeMap iteration order must agree with each input's physical
+//! row-group order, otherwise the streaming engine would `bail!`
+//! mid-merge when a page arrives from an unexpected row group.
+//!
+//! The encoding rules per parquet physical type:
+//! - `ByteArray` / `FixedLenByteArray`: byte-stuffed escape encoding — each `0x00` byte in the
+//!   value becomes `0x00 0x01`, and a `0x00 0x00` terminator is appended. This preserves
+//!   lexicographic order both for single columns (`"aa"` < `"b"`) and across concatenated composite
+//!   keys (the `0x00 0x00` terminator is the smallest possible 2-byte sequence under escaping, so
+//!   shorter values sort before longer values when prefixes match).
+//! - `Int32` / `Int64`: sign-flipped big-endian so byte order matches numeric order across the full
+//!   signed range.
+//! - `Boolean`: single 0/1 byte.
+//! - DESC columns: per-byte complement of the encoding above so smaller values' bytes sort
+//!   *larger*.
+//! - `Float` / `Double` / `Int96`: rejected with a clear error.
+
+use std::collections::{BTreeMap, HashSet};
+
+use anyhow::{Context, Result, anyhow, bail};
+use arrow::array::RecordBatch;
+use parquet::file::metadata::ParquetMetaData;
+
+use super::super::InputMetadata;
+use super::super::merge_order::MergeRun;
+use super::InputDecoderState;
+use crate::sort_fields::{is_timestamp_column_name, parse_sort_fields};
+use crate::sorted_series::append_prefix_col_to_key;
+
+/// One merge region: a contiguous slice of the merged output, where all
+/// contributing inputs share the same sort-prefix value (e.g., one
+/// `metric_name` when `rg_partition_prefix_len == 1`).
+///
+/// A region pairs with **at most one row group per input** — the
+/// property that makes per-region streaming work without
+/// column-chunk-bounded buffering. The `start_row` field on each
+/// contribution lets a single row group be sliced across multiple
+/// adjacent regions, which is how the engine subdivides a region at
+/// `sorted_series` transitions to honor `num_outputs` when one
+/// region (e.g. a giant single metric with `prefix_len=0`) would
+/// otherwise occupy a single output file.
+#[derive(Debug, Clone)]
+pub(crate) struct Region {
+    /// Sort-prefix value identifying this region (e.g., `metric_name`
+    /// bytes for `prefix_len == 1`). Used only for ordering and
+    /// diagnostics; the merge engine doesn't decode this value.
+    pub(crate) prefix_key: Vec<u8>,
+    /// Per contributing input: which slice of which row group belongs
+    /// to this region. Ordered by `input_idx`.
+    pub(crate) contributing: Vec<RegionContribution>,
+}
+
+/// One input's contribution to a region: the input index, the row
+/// group within that input, and the row range within that row group
+/// that belongs to the region.
+///
+/// For top-level regions from `extract_regions_from_metadata` each
+/// contribution covers a whole RG: `start_row == 0` and
+/// `num_rows == rg.num_rows()`. Sub-regions produced by
+/// `split_region_at_sorted_series` reference the same
+/// `(input_idx, rg_idx)` as their parent with disjoint contiguous row
+/// ranges.
+#[derive(Debug, Clone)]
+pub(crate) struct RegionContribution {
+    pub(crate) input_idx: usize,
+    pub(crate) rg_idx: usize,
+    pub(crate) start_row: usize,
+    pub(crate) num_rows: usize,
+}
+
+impl Region {
+    pub(crate) fn total_rows(&self) -> usize {
+        self.contributing.iter().map(|c| c.num_rows).sum()
+    }
+}
+
+/// A prefix column's location in the parquet schema, plus the sort
+/// direction and ordinal position declared for it. `name` is the
+/// sort-schema name (used in error messages); `parquet_col_idx` is
+/// the resolved index in the parquet schema's flat column list
+/// (after applying the `timestamp` / `timestamp_secs` alias);
+/// `ordinal` is the column's 0-based position in `qh.sort_fields`,
+/// matching `sorted_series`'s ordinal assignment so the per-RG
+/// prefix key composes as a literal prefix of every row's
+/// `sorted_series` key.
+#[derive(Debug, Clone)]
+pub(crate) struct PrefixColumn {
+    pub(crate) name: String,
+    pub(crate) parquet_col_idx: usize,
+    pub(crate) descending: bool,
+    pub(crate) ordinal: u8,
+}
+
+/// Resolve the first `prefix_len` sort columns to parquet leaf
+/// indices. Honours the legacy `timestamp` → `timestamp_secs` alias.
+///
+/// Returns one entry per requested prefix column. `Some(PrefixColumn)`
+/// when the column is present in the parquet schema; `None` when the
+/// column is named in `sort_fields_str` but absent from the parquet
+/// schema. Per SS-3 the missing column is treated as constant null at
+/// every row of the file — [`extract_rg_composite_prefix_key`]
+/// synthesizes a fixed byte sequence in that slot so ordering is
+/// driven entirely by the present columns.
+///
+/// Errors only when the sort schema declares fewer columns than
+/// requested — that means we don't have a *name* for one of the
+/// prefix columns and can't claim alignment on something we can't
+/// identify.
+pub(crate) fn find_prefix_parquet_col_indices(
+    metadata: &ParquetMetaData,
+    sort_fields_str: &str,
+    prefix_len: usize,
+    _input_idx: usize,
+) -> Result<Vec<Option<PrefixColumn>>> {
+    let sort_field_schema = parse_sort_fields(sort_fields_str)?;
+    if sort_field_schema.column.len() < prefix_len {
+        bail!(
+            "sort schema has {} columns but rg_partition_prefix_len = {prefix_len}",
+            sort_field_schema.column.len(),
+        );
+    }
+    let parquet_schema = metadata.file_metadata().schema_descr();
+    let mut prefix_cols = Vec::with_capacity(prefix_len);
+    for (pos, sort_col) in sort_field_schema.column.iter().take(prefix_len).enumerate() {
+        // Apply the same `timestamp` / `timestamp_secs` alias the rest
+        // of the engine uses.
+        let resolved = if is_timestamp_column_name(&sort_col.name)
+            && parquet_has_column(parquet_schema, "timestamp_secs")
+        {
+            "timestamp_secs"
+        } else {
+            sort_col.name.as_str()
+        };
+        let descending = sort_col.sort_direction
+            == quickwit_proto::sortschema::SortColumnDirection::SortDirectionDescending as i32;
+        let mut found = None;
+        for (col_idx, col) in parquet_schema.columns().iter().enumerate() {
+            if col.path().parts()[0] == resolved {
+                found = Some(col_idx);
+                break;
+            }
+        }
+        // SS-3: missing column → `None`. The composite-key extractor
+        // skips this slot entirely (no ordinal byte, no value bytes);
+        // the trailing prefix-length sentinel in
+        // `extract_rg_composite_prefix_key` ensures the resulting key
+        // still sorts cleanly relative to RGs with present values
+        // (and matches sorted_series's row-level null-skip).
+        //
+        // Ordinal matches the column's position in `qh.sort_fields`.
+        // For prefix cols (always the first `prefix_len` entries of
+        // the sort schema) the ordinal equals the iteration index
+        // `pos`, which is also the ordinal `sorted_series` would
+        // assign — so the per-RG prefix key composes as a literal
+        // byte prefix of every sorted_series key.
+        prefix_cols.push(found.map(|parquet_col_idx| PrefixColumn {
+            name: sort_col.name.clone(),
+            parquet_col_idx,
+            descending,
+            ordinal: pos as u8,
+        }));
+    }
+    Ok(prefix_cols)
+}
+
+fn parquet_has_column(
+    parquet_schema: &parquet::schema::types::SchemaDescriptor,
+    name: &str,
+) -> bool {
+    parquet_schema
+        .columns()
+        .iter()
+        .any(|c| c.path().parts()[0] == name)
+}
+
+/// Build the composite byte key identifying a row group's prefix
+/// values for grouping into a region. Per prefix column, contributes
+/// `storekey(ordinal) || storekey(value)` (with value bytes inverted
+/// for DESC columns) using
+/// [`crate::sorted_series::append_prefix_col_to_key`], the same
+/// encoding `sorted_series` produces — so this per-RG key is a
+/// literal byte prefix of every `sorted_series` key emitted by rows
+/// in this RG.
+///
+/// Null handling:
+/// - **Column absent from schema (`None` in `prefix_cols`)**: SS-3 case. Every row of the file has
+///   a constant null in this slot, so the contribution to the composite is empty (column skipped).
+///   The trailing prefix-length sentinel keeps the resulting key well-formed.
+/// - **All-null RG on a present prefix column**: column skipped for this RG (the next column's
+///   higher ordinal byte — or the trailing sentinel — appears in its place), so the RG sorts after
+///   any RG carrying a non-null value for this column. Mirrors the row-level convention in
+///   `sorted_series` and gives nulls-last ordering for free.
+/// - **Mixed null + non-null in one RG**: rows in the RG would encode to two distinct prefix keys
+///   (the non-null value's key and the column-skipped key), breaking the
+///   at-most-one-prefix-value-per-RG invariant (PA-1). Reject.
+/// - **No nulls**: standard `min == max` check on stats, then encode that single value.
+pub(crate) fn extract_rg_composite_prefix_key(
+    metadata: &ParquetMetaData,
+    rg_idx: usize,
+    prefix_cols: &[Option<PrefixColumn>],
+    input_idx: usize,
+) -> Result<Vec<u8>> {
+    let rg_meta = metadata.row_group(rg_idx);
+    let mut key = Vec::new();
+    for col_opt in prefix_cols {
+        let Some(col) = col_opt else {
+            // SS-3 implicit null: column absent from schema, so every
+            // row's value is null. Skip the slot entirely — the
+            // trailing prefix-length sentinel will keep this from
+            // colliding with present-value keys, and sorted_series
+            // applies the same "skip null cols" rule at the row level.
+            continue;
+        };
+        let chunk = rg_meta.column(col.parquet_col_idx);
+        let stats = chunk.statistics().ok_or_else(|| {
+            anyhow!(
+                "input {input_idx} rg {rg_idx} col '{}' has no statistics — cannot determine \
+                 prefix alignment without min/max",
+                col.name,
+            )
+        })?;
+
+        // Parquet's `num_values` is total cell count including nulls.
+        // `null_count_opt()` returns the explicitly-recorded null
+        // count (defaulting to 0 when absent, per parquet-rs guidance).
+        let num_values = chunk.num_values().max(0) as u64;
+        let null_count = stats.null_count_opt().unwrap_or(0);
+
+        if num_values > 0 && null_count == num_values {
+            // All-null RG: skip the column entirely (don't write its
+            // ordinal byte or value). The trailing prefix-length
+            // sentinel below ensures the resulting all-null key
+            // still sorts after any non-null key. See the sentinel
+            // comment for the full argument.
+            continue;
+        }
+        if null_count > 0 {
+            // PA-1 violation: see function doc. Parquet's min/max
+            // hide nulls, so an RG with N nulls + 1 non-null cell
+            // reports `min == max == non_null_value` even though
+            // rows in that RG encode to two distinct prefix keys.
+            bail!(
+                "input {input_idx} rg {rg_idx} col '{}' is NOT prefix-aligned: contains \
+                 {null_count} nulls plus {} non-null values. PA-1 requires each row group to \
+                 carry a single prefix value; rows with null on this column encode to a different \
+                 prefix key (with the column skipped) than rows with the non-null value.",
+                col.name,
+                num_values - null_count,
+            );
+        }
+
+        encode_prefix_col_value(stats, col, rg_idx, input_idx, &mut key)?;
+    }
+
+    // Trailing prefix-length sentinel: an additional `u8(prefix_len)`
+    // ordinal byte that does two things at once:
+    //
+    // 1. **Forces nulls-last ordering across RGs.** For prefix_len=1 an all-null RG produces an
+    //    empty per-column body and would otherwise lex-sort *before* any non-null RG. With the
+    //    sentinel, the all-null key becomes `[prefix_len]` and the non-null key becomes `[ord(0),
+    //    storekey(value), ..., prefix_len]`. The non-null key starts with `ord(0) = 0x00`, smaller
+    //    than `prefix_len >= 1`, so non-null RGs sort first — matching `sorted_series`'s row-level
+    //    nulls-last convention via the same "the next ordinal byte appears in the skipped slot"
+    //    mechanism.
+    // 2. **Preserves the "literal prefix of sorted_series" property.** The byte we append is
+    //    exactly what `sorted_series` writes right after the prefix columns: the ordinal of the
+    //    next sort-schema column (`u8(prefix_len)`). So the per-RG key remains a byte-for-byte
+    //    prefix of every row's `sorted_series` value in that RG.
+    storekey::encode(&mut key, &(prefix_cols.len() as u8))
+        .map_err(|e| anyhow!("storekey encode prefix-length sentinel: {}", e))?;
+
+    Ok(key)
+}
+
+/// Verify `min == max` on the column chunk's non-null stats and
+/// append the single value to `key` via
+/// [`crate::sorted_series::append_prefix_col_to_key`] (which handles
+/// the ordinal prefix + descending-direction byte inversion). Caller
+/// has already filtered out all-null and mixed-null cases.
+///
+/// `Statistics::ByteArray` values are routed through the
+/// `Encode for str` impl after a UTF-8 check — every realistic sort
+/// prefix column (`metric_name`, `service`, tag names) is UTF-8
+/// text, and `sorted_series` itself only encodes strings, so the
+/// "byte prefix of sorted_series" property only holds for UTF-8
+/// values. Non-UTF-8 byte-array prefix cols would never match a
+/// `sorted_series` key in practice (sorted_series would not encode
+/// them either) and so are rejected up front.
+fn encode_prefix_col_value(
+    stats: &parquet::file::statistics::Statistics,
+    col: &PrefixColumn,
+    rg_idx: usize,
+    input_idx: usize,
+    key: &mut Vec<u8>,
+) -> Result<()> {
+    use parquet::file::statistics::Statistics;
+
+    fn require_eq<T: PartialEq + std::fmt::Debug>(
+        min: Option<T>,
+        max: Option<T>,
+        col_name: &str,
+        rg_idx: usize,
+        input_idx: usize,
+    ) -> Result<T> {
+        let min = min.ok_or_else(|| {
+            anyhow!(
+                "input {input_idx} rg {rg_idx} col '{col_name}' has no min in stats — cannot \
+                 determine prefix alignment"
+            )
+        })?;
+        let max = max.ok_or_else(|| {
+            anyhow!(
+                "input {input_idx} rg {rg_idx} col '{col_name}' has no max in stats — cannot \
+                 determine prefix alignment"
+            )
+        })?;
+        if min != max {
+            bail!(
+                "input {input_idx} rg {rg_idx} is NOT prefix-aligned on col '{col_name}': min \
+                 ({:?}) != max ({:?}). Multi-RG inputs declaring `rg_partition_prefix_len >= 1` \
+                 must carry one prefix-value per RG.",
+                min,
+                max,
+            );
+        }
+        Ok(min)
+    }
+
+    fn encode_byte_array_value(
+        min_bytes: Option<&[u8]>,
+        max_bytes: Option<&[u8]>,
+        col: &PrefixColumn,
+        rg_idx: usize,
+        input_idx: usize,
+        key: &mut Vec<u8>,
+    ) -> Result<()> {
+        let value = require_eq(
+            min_bytes.map(|b| b.to_vec()),
+            max_bytes.map(|b| b.to_vec()),
+            &col.name,
+            rg_idx,
+            input_idx,
+        )?;
+        let s = std::str::from_utf8(&value).map_err(|_| {
+            anyhow!(
+                "input {input_idx} rg {rg_idx} col '{}' has non-UTF-8 byte-array prefix value; \
+                 only UTF-8 string prefix columns are supported (matching sorted_series's `&str` \
+                 encoding)",
+                col.name,
+            )
+        })?;
+        append_prefix_col_to_key(key, col.ordinal, s, col.descending)
+    }
+
+    match stats {
+        Statistics::ByteArray(v) => {
+            encode_byte_array_value(
+                v.min_bytes_opt(),
+                v.max_bytes_opt(),
+                col,
+                rg_idx,
+                input_idx,
+                key,
+            )?;
+        }
+        Statistics::FixedLenByteArray(v) => {
+            encode_byte_array_value(
+                v.min_bytes_opt(),
+                v.max_bytes_opt(),
+                col,
+                rg_idx,
+                input_idx,
+                key,
+            )?;
+        }
+        Statistics::Int32(v) => {
+            let value = require_eq(
+                v.min_opt().copied(),
+                v.max_opt().copied(),
+                &col.name,
+                rg_idx,
+                input_idx,
+            )?;
+            append_prefix_col_to_key(key, col.ordinal, &value, col.descending)?;
+        }
+        Statistics::Int64(v) => {
+            let value = require_eq(
+                v.min_opt().copied(),
+                v.max_opt().copied(),
+                &col.name,
+                rg_idx,
+                input_idx,
+            )?;
+            append_prefix_col_to_key(key, col.ordinal, &value, col.descending)?;
+        }
+        Statistics::Boolean(v) => {
+            let value = require_eq(
+                v.min_opt().copied(),
+                v.max_opt().copied(),
+                &col.name,
+                rg_idx,
+                input_idx,
+            )?;
+            append_prefix_col_to_key(key, col.ordinal, &value, col.descending)?;
+        }
+        Statistics::Float(_) | Statistics::Double(_) => bail!(
+            "prefix col '{}' is floating-point; composite-key extraction does not yet support \
+             IEEE-754 ordering. Open an issue if you hit this — the encoding needs a sign-aware \
+             bit flip on negative values.",
+            col.name,
+        ),
+        Statistics::Int96(_) => bail!(
+            "prefix col '{}' is Int96 (deprecated timestamp type); use Int64-encoded \
+             `timestamp_secs` instead.",
+            col.name,
+        ),
+    }
+    Ok(())
+}
+
+/// MS-2: verify that, for each input, the regions list visits its
+/// row groups in physical (on-disk) order. The streaming engine
+/// drains each input sequentially — once we've moved past RG K's
+/// pages we cannot go back. The composite-key BTreeMap iteration
+/// must agree with that physical ordering for every input.
+///
+/// Disagreement usually means the input file's sort direction was
+/// declared one way but the data was written the other — e.g., the
+/// sort schema says `metric_name ASC` but the file has RG 0 with
+/// metric `z` and RG 1 with metric `a`. Reject upfront with a clear
+/// message rather than letting `process_region` bail mid-merge with
+/// "page from rg X while draining rg Y" — same error class, but
+/// caught at a point where we can report the offending input and
+/// region instead of the lower-level page mismatch.
+pub(crate) fn validate_region_order_matches_physical_rg_order(
+    regions: &[Region],
+    num_inputs: usize,
+) -> Result<()> {
+    let mut last_position_per_input: Vec<Option<(usize, usize)>> = vec![None; num_inputs];
+    for (region_idx, region) in regions.iter().enumerate() {
+        for c in &region.contributing {
+            let position = (c.rg_idx, c.start_row);
+            if let Some(prev) = last_position_per_input[c.input_idx]
+                && position < prev
+            {
+                bail!(
+                    "region iteration disagrees with input {}'s physical row order: region \
+                     {region_idx} wants rg {} row {} but a previous region already passed \
+                     position rg {} row {}. The composite prefix key encoding does not match the \
+                     input's physical layout — check that the sort schema's direction matches how \
+                     the file is actually sorted on disk.",
+                    c.input_idx,
+                    c.rg_idx,
+                    c.start_row,
+                    prev.0,
+                    prev.1,
+                );
+            }
+            last_position_per_input[c.input_idx] = Some((c.rg_idx, c.start_row + c.num_rows));
+        }
+    }
+    Ok(())
+}
+
+/// Build the region list across all inputs.
+///
+/// - If `rg_partition_prefix_len == 0`: all inputs must be single-RG (caller's job to validate);
+///   produces ONE region with each input's only RG. The region's `prefix_key` is empty (no
+///   alignment claim).
+/// - If `rg_partition_prefix_len >= 1`: reads each input's per-RG prefix col stats (must have `min
+///   == max`), groups RGs across inputs by composite prefix key, sorts regions by that key.
+///
+/// Returns regions in sort order (sort prefix ASC).
+pub(crate) fn extract_regions_from_metadata(
+    decoders_state: &[InputDecoderState],
+    input_meta: &InputMetadata,
+) -> Result<Vec<Region>> {
+    if input_meta.rg_partition_prefix_len == 0 {
+        // No alignment claim: single region covering each input's only RG.
+        // Multi-RG inputs with prefix_len == 0 are rejected earlier; here
+        // each input is single-RG (or zero-RG).
+        //
+        // The caller (`run_synthesized_prefix_path`) will split this
+        // region further once it has read the merge order and can see
+        // where the first sort col's value transitions.
+        let mut contributing = Vec::new();
+        for (idx, state) in decoders_state.iter().enumerate() {
+            if state.metadata.num_row_groups() == 0 {
+                continue;
+            }
+            let rg_meta = state.metadata.row_group(0);
+            contributing.push(RegionContribution {
+                input_idx: idx,
+                rg_idx: 0,
+                start_row: 0,
+                num_rows: rg_meta.num_rows() as usize,
+            });
+        }
+        if contributing.is_empty() {
+            return Ok(Vec::new());
+        }
+        return Ok(vec![Region {
+            prefix_key: Vec::new(),
+            contributing,
+        }]);
+    }
+
+    // Prefix_len >= 1: build regions by composite prefix key from
+    // per-RG stats. See `extract_rg_composite_prefix_key` for the
+    // per-type encoding.
+    //
+    // **Strong invariant** (enforced here on the merge read path, and
+    // mirrored on both write paths — see `assert_unique_rg_prefix_keys`):
+    // no single input may have two row groups sharing the same composite
+    // prefix key. The streaming engine pairs at most one RG per input
+    // per region (`process_region` keys `sort_col_batches` by
+    // `input_idx`), so a duplicate prefix would silently overwrite the
+    // first RG's sort batch while `Region::total_rows` still counts both
+    // — dropping rows and corrupting body-col / sort-col alignment.
+    //
+    // Cross-input duplicates are fine (and expected — that's the whole
+    // point of region merging). The constraint is **same input, same
+    // prefix key, multiple RGs**: producers must ensure prefix
+    // transitions align with RG boundaries.
+    let mut by_prefix: BTreeMap<Vec<u8>, Vec<RegionContribution>> = BTreeMap::new();
+    let prefix_len = input_meta.rg_partition_prefix_len as usize;
+
+    for (input_idx, state) in decoders_state.iter().enumerate() {
+        if state.metadata.num_row_groups() == 0 {
+            continue;
+        }
+        let prefix_cols = find_prefix_parquet_col_indices(
+            &state.metadata,
+            &input_meta.sort_fields,
+            prefix_len,
+            input_idx,
+        )
+        .with_context(|| format!("resolving prefix cols for input {input_idx}"))?;
+        let mut seen_for_input: HashSet<Vec<u8>> = HashSet::new();
+        for rg_idx in 0..state.metadata.num_row_groups() {
+            let prefix_key =
+                extract_rg_composite_prefix_key(&state.metadata, rg_idx, &prefix_cols, input_idx)?;
+            if !seen_for_input.insert(prefix_key.clone()) {
+                bail!(
+                    "input {input_idx} has rg {rg_idx} sharing a prefix key with an earlier RG in \
+                     the same file. The streaming merge engine requires at-most-one-RG-per-input \
+                     per prefix value (rg_partition_prefix_len = {prefix_len}); the producer must \
+                     ensure prefix transitions align with RG boundaries. Either lower \
+                     rg_partition_prefix_len to include fewer columns, or rewrite the producer to \
+                     start a new RG at every prefix-value change."
+                );
+            }
+            let num_rows = state.metadata.row_group(rg_idx).num_rows() as usize;
+            by_prefix
+                .entry(prefix_key)
+                .or_default()
+                .push(RegionContribution {
+                    input_idx,
+                    rg_idx,
+                    start_row: 0,
+                    num_rows,
+                });
+        }
+    }
+
+    Ok(by_prefix
+        .into_iter()
+        .map(|(prefix_key, contributing)| Region {
+            prefix_key,
+            contributing,
+        })
+        .collect())
+}
+
+/// Post-write check: verify every row group in `metadata` satisfies
+/// the prefix-alignment claim declared by `prefix_len`.
+///
+/// Enforces both halves of the prefix-alignment contract in one pass:
+/// - **PA-1 (intra-RG constancy):** within each RG, each of the first `prefix_len` sort columns has
+///   `min == max` (the column is constant across the RG). This is checked transitively by
+///   [`extract_rg_composite_prefix_key`] — it returns an error when any prefix column's chunk stats
+///   show `min != max`.
+/// - **PA-3 (inter-RG uniqueness):** no two RGs share the same composite prefix value. The
+///   streaming engine pairs at most one input RG per region per prefix value, so a duplicate would
+///   silently drop rows or corrupt the body-col / sort-col mapping.
+///
+/// Returns `Ok(())` immediately when `prefix_len == 0` (no claim to
+/// verify) or `num_rgs == 0` (no RGs to check). Single-RG files are
+/// NOT short-circuited — they still go through PA-1 because an
+/// unsorted single-RG file CAN have `min != max` on a prefix column.
+///
+/// This is the writer-side mirror of the read-side check in
+/// `extract_regions_from_metadata` — both indexing and the compaction
+/// merge output writer call this after sealing a parquet file so a
+/// producer bug never lets a duplicate-prefix file land on disk. See
+/// the doc-comment on `extract_regions_from_metadata` for why
+/// at-most-one-RG-per-prefix is load-bearing for the streaming
+/// engine.
+///
+/// `context` is included in the error message — e.g.,
+/// `"indexing write at <path>"` or `"merge output <split_id>"`.
+pub(crate) fn assert_unique_rg_prefix_keys(
+    metadata: &ParquetMetaData,
+    sort_fields_str: &str,
+    prefix_len: u32,
+    context: &str,
+) -> Result<()> {
+    if prefix_len == 0 {
+        return Ok(());
+    }
+    let num_rgs = metadata.num_row_groups();
+    if num_rgs == 0 {
+        // Zero-RG files vacuously satisfy both halves of the claim.
+        return Ok(());
+    }
+    let prefix_cols =
+        find_prefix_parquet_col_indices(metadata, sort_fields_str, prefix_len as usize, 0)
+            .with_context(|| format!("resolving prefix cols for {context}"))?;
+    let mut seen: HashSet<Vec<u8>> = HashSet::with_capacity(num_rgs);
+    for rg_idx in 0..num_rgs {
+        let key = extract_rg_composite_prefix_key(metadata, rg_idx, &prefix_cols, 0)
+            .with_context(|| format!("extracting prefix key at {context} rg {rg_idx}"))?;
+        if !seen.insert(key) {
+            bail!(
+                "{context}: rg {rg_idx} shares a prefix key with an earlier row group. \
+                 rg_partition_prefix_len = {prefix_len} requires prefix transitions to align with \
+                 row group boundaries. Either lower the prefix length to include fewer columns, \
+                 or change the writer so each RG carries a unique value of the first {prefix_len} \
+                 sort columns."
+            );
+        }
+    }
+    Ok(())
+}
+
+/// Subdivide a region into a sequence of sub-regions whose cumulative
+/// row counts approach `target_per_output`, splitting only at
+/// `sorted_series` transitions within the region's merge order. A
+/// single `sorted_series` run is never broken — if one run exceeds
+/// the remaining budget, the whole run goes to one output anyway.
+///
+/// `first_target` is the budget for the FIRST sub-region (typically
+/// the remaining capacity of the current output being filled by the
+/// caller). Subsequent sub-regions target `target_per_output`.
+/// `outputs_remaining` is the number of output files still available;
+/// when it hits 1 we stop splitting and emit the rest as one sub-
+/// region.
+///
+/// The returned sub-regions:
+/// - Cover the full input region in sort order.
+/// - Each carries per-input row ranges (`start_row`/`num_rows`) inside the same `(input_idx,
+///   rg_idx)` as the parent — sub-regions of one region all share their parent's RGs.
+/// - Inherit the parent's `prefix_key`; the prefix value is constant across the parent and
+///   therefore across every sub-region.
+pub(crate) fn split_region_at_sorted_series(
+    region: &Region,
+    merge_order: &[MergeRun],
+    aligned_sort_batches: &[RecordBatch],
+    first_target: usize,
+    target_per_output: usize,
+    outputs_remaining: usize,
+) -> Result<Vec<Region>> {
+    use arrow::array::BinaryArray;
+
+    use crate::sorted_series::SORTED_SERIES_COLUMN;
+
+    if merge_order.is_empty() {
+        return Ok(Vec::new());
+    }
+    if outputs_remaining <= 1 {
+        return Ok(vec![region.clone()]);
+    }
+
+    // Per-input sorted_series array. compute_merge_order already
+    // requires this column on every input, so a missing-column case
+    // here is a bug rather than a configuration error.
+    let mut ss_arrays: Vec<Option<&BinaryArray>> = Vec::with_capacity(aligned_sort_batches.len());
+    for batch in aligned_sort_batches {
+        match batch.schema().index_of(SORTED_SERIES_COLUMN) {
+            Ok(idx) => {
+                let arr = batch
+                    .column(idx)
+                    .as_any()
+                    .downcast_ref::<BinaryArray>()
+                    .ok_or_else(|| anyhow!("`{SORTED_SERIES_COLUMN}` must be Binary-typed"))?;
+                ss_arrays.push(Some(arr));
+            }
+            Err(_) => ss_arrays.push(None),
+        }
+    }
+
+    let ss_at = |run_idx: usize| -> Option<&[u8]> {
+        let run = &merge_order[run_idx];
+        ss_arrays[run.input_index].map(|a| a.value(run.start_row))
+    };
+
+    // Walk runs, splitting before a run whose preceding sorted_series
+    // transition crosses the current target. We can only split at run
+    // boundaries (a run has constant sorted_series internally), so
+    // breaking inside a run is impossible — a giant single-series run
+    // simply lands in one output regardless of size.
+    let mut splits: Vec<std::ops::Range<usize>> = Vec::new();
+    let mut current_start: usize = 0;
+    let mut accumulated: usize = 0;
+    let mut current_target = first_target;
+    let mut outputs_left = outputs_remaining;
+
+    for (run_idx, run) in merge_order.iter().enumerate() {
+        if run_idx > 0 && outputs_left > 1 && accumulated >= current_target {
+            let prev_ss = ss_at(run_idx - 1);
+            let curr_ss = ss_at(run_idx);
+            let at_transition = match (prev_ss, curr_ss) {
+                (Some(a), Some(b)) => a != b,
+                _ => true,
+            };
+            if at_transition {
+                splits.push(current_start..run_idx);
+                current_start = run_idx;
+                accumulated = 0;
+                outputs_left -= 1;
+                current_target = target_per_output;
+            }
+        }
+        accumulated += run.row_count;
+    }
+    splits.push(current_start..merge_order.len());
+
+    // Build each sub-region's contributing list from the runs in its
+    // range. Within a sub-region, each input's rows are contiguous
+    // (the merge engine consumes rows in increasing input-row order
+    // and the parent region's contributions are themselves
+    // contiguous), so a `(min_run.start_row, sum_row_count)` range
+    // captures the full slice.
+    let rg_for_input: std::collections::HashMap<usize, usize> = region
+        .contributing
+        .iter()
+        .map(|c| (c.input_idx, c.rg_idx))
+        .collect();
+    let parent_start_row: std::collections::HashMap<usize, usize> = region
+        .contributing
+        .iter()
+        .map(|c| (c.input_idx, c.start_row))
+        .collect();
+
+    let mut sub_regions: Vec<Region> = Vec::with_capacity(splits.len());
+    for range in splits {
+        let mut ranges: BTreeMap<usize, (usize, usize)> = BTreeMap::new();
+        for run in &merge_order[range.clone()] {
+            let entry = ranges
+                .entry(run.input_index)
+                .or_insert((run.start_row, run.start_row));
+            entry.0 = entry.0.min(run.start_row);
+            entry.1 = entry.1.max(run.start_row + run.row_count);
+        }
+        let contributing: Vec<RegionContribution> = ranges
+            .into_iter()
+            .map(|(input_idx, (start, end))| RegionContribution {
+                input_idx,
+                rg_idx: *rg_for_input.get(&input_idx).expect("rg_idx from parent"),
+                // The merge order's run.start_row is local to the
+                // aligned sort batch (which itself is the drained
+                // contribution); add the parent's start_row to get
+                // the absolute row inside the RG.
+                start_row: parent_start_row.get(&input_idx).copied().unwrap_or(0) + start,
+                num_rows: end - start,
+            })
+            .collect();
+        sub_regions.push(Region {
+            prefix_key: region.prefix_key.clone(),
+            contributing,
+        });
+    }
+
+    Ok(sub_regions)
+}
+
+/// Assign each region to an output file index.
+///
+/// Splits the region list across `num_outputs` files, balancing
+/// cumulative row count. Each output file gets a contiguous slice of
+/// the region list (preserving sort-prefix order so output files have
+/// non-overlapping key ranges). Returns a `Vec<usize>` indexed by
+/// `region_idx` with the target output file index.
+///
+/// If `regions.len() < num_outputs`, fewer output files are produced
+/// (matches the non-streaming engine's behaviour when there aren't
+/// enough split points).
+pub(crate) fn assign_regions_to_output_files(regions: &[Region], num_outputs: usize) -> Vec<usize> {
+    let total_rows: usize = regions.iter().map(|r| r.total_rows()).sum();
+    let effective_num_outputs = num_outputs.min(regions.len()).max(1);
+    let target_rows_per_output = total_rows.div_ceil(effective_num_outputs).max(1);
+
+    let mut assignments = Vec::with_capacity(regions.len());
+    let mut current_output = 0;
+    let mut accumulated = 0;
+    for region in regions {
+        // If this region would push us past the target AND we have
+        // budget to start a new output AND the current output already
+        // has rows, advance to next output BEFORE assigning.
+        if accumulated > 0
+            && accumulated + region.total_rows() > target_rows_per_output
+            && current_output + 1 < effective_num_outputs
+        {
+            current_output += 1;
+            accumulated = 0;
+        }
+        assignments.push(current_output);
+        accumulated += region.total_rows();
+    }
+    assignments
+}
diff --git a/quickwit/quickwit-parquet-engine/src/merge/tests.rs b/quickwit/quickwit-parquet-engine/src/merge/tests.rs
index 419057db1de..0ef9b45288e 100644
--- a/quickwit/quickwit-parquet-engine/src/merge/tests.rs
+++ b/quickwit/quickwit-parquet-engine/src/merge/tests.rs
@@ -1679,6 +1679,478 @@ fn test_merge_descending_pre_timestamp_column() {
     );
 }
 
+// ---- Engine parity: streaming vs in-memory ----
+//
+// Verifies that `execute_merge_operation` (the streaming column-major engine)
+// and `merge_sorted_parquet_files` (the in-memory engine) produce the same
+// row content for the same inputs. Gates the executor's
+// `parquet_merge_use_streaming_engine` YAML flag: once parity holds on a
+// realistic fixture, production can flip the flag with confidence the
+// streaming engine is a drop-in replacement. The in-memory engine stays in
+// the executor as the runtime fallback until the streaming engine has
+// soaked, at which point both this test and the fallback branch can be
+// removed.
+
+mod parity {
+    use std::io;
+    use std::ops::Range;
+    use std::path::Path;
+
+    use async_trait::async_trait;
+    use bytes::Bytes;
+    use tokio::io::{AsyncRead, AsyncReadExt, AsyncSeekExt};
+
+    use super::*;
+    use crate::merge::execute_merge_operation;
+    use crate::merge::policy::ParquetMergeOperation;
+    use crate::storage::RemoteByteSource;
+
+    /// Mirrors the executor's `LocalFileByteSource` so parity tests exercise
+    /// the same code path production will use: a local Parquet file wrapped
+    /// in `RemoteByteSource` and handed to `execute_merge_operation`.
+    /// Ignores the `_path` argument from the trait surface — the source is
+    /// already bound to one concrete file at construction time.
+    struct LocalFileByteSource {
+        path: PathBuf,
+    }
+
+    #[async_trait]
+    impl RemoteByteSource for LocalFileByteSource {
+        async fn file_size(&self, _path: &Path) -> io::Result<u64> {
+            tokio::fs::metadata(&self.path).await.map(|m| m.len())
+        }
+
+        async fn get_slice(&self, _path: &Path, range: Range<u64>) -> io::Result<Bytes> {
+            let mut file = tokio::fs::File::open(&self.path).await?;
+            file.seek(io::SeekFrom::Start(range.start)).await?;
+            let len = (range.end - range.start) as usize;
+            let mut buf = vec![0u8; len];
+            file.read_exact(&mut buf).await?;
+            Ok(Bytes::from(buf))
+        }
+
+        async fn get_slice_stream(
+            &self,
+            _path: &Path,
+            range: Range<u64>,
+        ) -> io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+            let mut file = tokio::fs::File::open(&self.path).await?;
+            file.seek(io::SeekFrom::Start(range.start)).await?;
+            let len = range.end - range.start;
+            Ok(Box::new(file.take(len)))
+        }
+    }
+
+    /// Build a `ParquetMergeOperation` for a regular (non-promotion) merge
+    /// over the given file paths. Each split's `parquet_file` is set to the
+    /// basename — the streaming engine passes it to `RemoteByteSource`,
+    /// which `LocalFileByteSource` ignores.
+    fn make_regular_merge_op(input_paths: &[PathBuf]) -> ParquetMergeOperation {
+        let splits: Vec<ParquetSplitMetadata> = input_paths
+            .iter()
+            .map(|path| {
+                let name = path
+                    .file_name()
+                    .and_then(|n| n.to_str())
+                    .expect("input path must have a UTF-8 filename");
+                ParquetSplitMetadata::metrics_builder()
+                    .split_id(ParquetSplitId::generate(ParquetSplitKind::Metrics))
+                    .index_uid("test-index:0")
+                    .sort_fields(TEST_SORT_FIELDS)
+                    .window_duration_secs(900)
+                    .window_start_secs(0)
+                    .time_range(TimeRange::new(0, 1))
+                    .rg_partition_prefix_len(0)
+                    .parquet_file(name)
+                    .build()
+            })
+            .collect();
+        ParquetMergeOperation::new(splits)
+    }
+
+    /// Run both engines on the same inputs and assert the materialised row
+    /// content matches column-by-column on every output, in order. Per-output
+    /// physical metadata (page boundaries, row group sizing) is allowed to
+    /// differ — we only assert what the merge contract guarantees to readers.
+    fn assert_engine_parity(input_paths: &[PathBuf], num_outputs: usize) {
+        // The streaming engine increments the process-global
+        // `PEAK_BODY_COL_PAGE_CACHE_LEN` atomic that MS-7 tests
+        // reset-then-read. Serialise against MS-7 by acquiring the
+        // same lock for the duration of the streaming-engine run.
+        let _ms7_guard = crate::merge::streaming::ms7_serial_lock();
+
+        let parent_dir = input_paths
+            .first()
+            .and_then(|p| p.parent())
+            .expect("at least one input expected");
+        let in_memory_out_dir = parent_dir.join("output_in_memory");
+        let streaming_out_dir = parent_dir.join("output_streaming");
+        std::fs::create_dir_all(&in_memory_out_dir).unwrap();
+        std::fs::create_dir_all(&streaming_out_dir).unwrap();
+
+        let config = MergeConfig {
+            num_outputs,
+            writer_config: ParquetWriterConfig::default(),
+        };
+
+        // In-memory engine.
+        let in_memory_outputs =
+            merge_sorted_parquet_files(input_paths, &in_memory_out_dir, &config).unwrap();
+
+        // Streaming engine through the same surface the executor uses.
+        let op = make_regular_merge_op(input_paths);
+        let sources: Vec<Arc<dyn RemoteByteSource>> = input_paths
+            .iter()
+            .map(|path| {
+                Arc::new(LocalFileByteSource { path: path.clone() }) as Arc<dyn RemoteByteSource>
+            })
+            .collect();
+        let runtime = tokio::runtime::Builder::new_current_thread()
+            .enable_all()
+            .build()
+            .unwrap();
+        let streaming_outputs = runtime
+            .block_on(execute_merge_operation(
+                &op,
+                sources,
+                &streaming_out_dir,
+                &config,
+            ))
+            .unwrap();
+
+        assert_eq!(
+            streaming_outputs.len(),
+            in_memory_outputs.len(),
+            "engines disagree on output count"
+        );
+        for (i, (s, m)) in streaming_outputs
+            .iter()
+            .zip(in_memory_outputs.iter())
+            .enumerate()
+        {
+            assert_eq!(
+                s.num_rows, m.num_rows,
+                "output #{i}: engines disagree on row count ({} vs {})",
+                s.num_rows, m.num_rows
+            );
+            let s_batch = read_parquet_file(&s.path);
+            let m_batch = read_parquet_file(&m.path);
+            for col_name in ["metric_name", "timestamp_secs", "value", "timeseries_id"] {
+                if s_batch.schema().index_of(col_name).is_err() {
+                    continue;
+                }
+                let m_idx = m_batch.schema().index_of(col_name).unwrap_or_else(|_| {
+                    panic!("output #{i}: streaming has column {col_name} but in-memory does not")
+                });
+                let s_idx = s_batch.schema().index_of(col_name).unwrap();
+                // Compare column values through the existing string/numeric
+                // extractors; raw Arrow array equality would reject benign
+                // representational differences like the dict-key permutation.
+                match s_batch.column(s_idx).data_type() {
+                    DataType::Dictionary(_, _) | DataType::Utf8 | DataType::LargeUtf8 => {
+                        let s_vals = extract_string_column(&s_batch, col_name);
+                        let m_vals = extract_string_column(&m_batch, col_name);
+                        assert_eq!(
+                            s_vals, m_vals,
+                            "output #{i}: column {col_name} (string) differs between engines"
+                        );
+                    }
+                    DataType::UInt64 => {
+                        let s_vals = extract_u64_column(&s_batch, col_name);
+                        let m_vals = extract_u64_column(&m_batch, col_name);
+                        assert_eq!(
+                            s_vals, m_vals,
+                            "output #{i}: column {col_name} (u64) differs between engines"
+                        );
+                    }
+                    DataType::Float64 => {
+                        let s_vals = extract_f64_column(&s_batch, col_name);
+                        let m_vals = extract_f64_column(&m_batch, col_name);
+                        assert_eq!(
+                            s_vals, m_vals,
+                            "output #{i}: column {col_name} (f64) differs between engines"
+                        );
+                    }
+                    DataType::Int64 => {
+                        let s_idx = s_batch.schema().index_of(col_name).unwrap();
+                        let s_col = s_batch
+                            .column(s_idx)
+                            .as_any()
+                            .downcast_ref::<Int64Array>()
+                            .unwrap();
+                        let m_col = m_batch
+                            .column(m_idx)
+                            .as_any()
+                            .downcast_ref::<Int64Array>()
+                            .unwrap();
+                        assert_eq!(
+                            s_col.values(),
+                            m_col.values(),
+                            "output #{i}: column {col_name} (i64) differs between engines"
+                        );
+                    }
+                    other => {
+                        panic!(
+                            "output #{i}: column {col_name} has unexpected type {other:?} for \
+                             parity comparison"
+                        );
+                    }
+                }
+            }
+        }
+
+        // Beyond engine-vs-engine equivalence, both outputs must satisfy
+        // the m:n merge contract: every row preserved exactly once across
+        // outputs, each output internally sorted on `sorted_series`, and
+        // no two outputs sharing any `sorted_series` value (the engine
+        // promises a non-overlapping partition over the input keyspace).
+        // The two engines produce equivalent output, so checking either
+        // is sufficient; the in-memory side is the historical baseline.
+        let total_input_rows: usize = input_paths
+            .iter()
+            .map(|path| {
+                let bytes = std::fs::read(path).expect("read input parquet for row-count");
+                let builder =
+                    parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder::try_new(
+                        Bytes::from(bytes),
+                    )
+                    .expect("input parquet builder");
+                builder.metadata().file_metadata().num_rows() as usize
+            })
+            .sum();
+        assert_multi_output_invariants(&in_memory_outputs, total_input_rows);
+    }
+
+    /// Verify the m:n merge contract on a single engine's outputs:
+    ///
+    /// 1. Sum of per-output row counts equals the total input row count (no duplication, no loss).
+    /// 2. Within each output, the `sorted_series` column is monotonically non-decreasing.
+    /// 3. Across outputs, after sorting by min `sorted_series`, every output's max sorted_series is
+    ///    strictly less than the next output's min — the partition is disjoint on the keyspace.
+    ///
+    /// Holds for any merge with `num_outputs >= 1`. Trivial for n=1
+    /// (only invariant 2 is non-trivial there).
+    fn assert_multi_output_invariants(
+        outputs: &[crate::merge::MergeOutputFile],
+        total_input_rows: usize,
+    ) {
+        let total_output_rows: u64 = outputs.iter().map(|o| o.num_rows as u64).sum();
+        assert_eq!(
+            total_output_rows, total_input_rows as u64,
+            "sum of output row counts ({total_output_rows}) must equal total input rows \
+             ({total_input_rows}) — MC-1"
+        );
+
+        let mut series_ranges: Vec<(Vec<u8>, Vec<u8>, &Path)> = Vec::with_capacity(outputs.len());
+        for output in outputs {
+            let batch = read_parquet_file(&output.path);
+            let series = extract_binary_column(&batch, SORTED_SERIES_COLUMN);
+            assert!(
+                !series.is_empty(),
+                "output {} has zero rows (engine should have dropped it)",
+                output.path.display(),
+            );
+            for i in 1..series.len() {
+                assert!(
+                    series[i] >= series[i - 1],
+                    "output {}: sorted_series not monotone at row {i}",
+                    output.path.display(),
+                );
+            }
+            series_ranges.push((
+                series.first().unwrap().clone(),
+                series.last().unwrap().clone(),
+                output.path.as_path(),
+            ));
+        }
+
+        // Sort by min_series so we can walk adjacent pairs to check
+        // disjointness. The engine's staged order isn't a documented
+        // contract — sort here for the comparison.
+        series_ranges.sort_by(|a, b| a.0.cmp(&b.0));
+        for window in series_ranges.windows(2) {
+            let (_, left_max, left_path) = &window[0];
+            let (right_min, _, right_path) = &window[1];
+            assert!(
+                left_max < right_min,
+                "outputs {} and {} overlap on sorted_series: left max = {:?}, right min = {:?}",
+                left_path.display(),
+                right_path.display(),
+                left_max,
+                right_min,
+            );
+        }
+    }
+
+    /// Multi-input, single-output merge. The streaming and in-memory engines
+    /// must produce the same number of rows and the same row-by-row content
+    /// on every visible column.
+    #[test]
+    fn parity_multi_input_single_output() {
+        let dir = TempDir::new().unwrap();
+
+        // Three inputs with overlapping metric names so the merge interleaves
+        // across all three streams (the streaming engine's hot path under the
+        // page-bounded body cache).
+        let input1 = write_test_split(
+            dir.path(),
+            "p1.parquet",
+            &["cpu", "cpu", "mem"],
+            &[300, 100, 200],
+            &[1.0, 2.0, 3.0],
+            &[42, 42, 99],
+        );
+        let input2 = write_test_split(
+            dir.path(),
+            "p2.parquet",
+            &["cpu", "mem", "mem"],
+            &[250, 250, 100],
+            &[4.0, 5.0, 6.0],
+            &[42, 99, 99],
+        );
+        let input3 = write_test_split(
+            dir.path(),
+            "p3.parquet",
+            &["cpu", "mem", "net"],
+            &[200, 150, 100],
+            &[7.0, 8.0, 9.0],
+            &[42, 99, 7],
+        );
+
+        assert_engine_parity(&[input1, input2, input3], 1);
+    }
+
+    /// Multi-input, multi-output merge. With `num_outputs > 1` the streaming
+    /// engine splits at `sorted_series` boundaries; this test guards that the
+    /// split policy and per-output row content match the in-memory engine,
+    /// and that the engine's multi-output contract holds (sum-equals-total,
+    /// internal monotonicity, inter-output disjointness — see
+    /// `assert_multi_output_invariants`).
+    #[test]
+    fn parity_multi_input_multi_output() {
+        let dir = TempDir::new().unwrap();
+
+        let input1 = write_test_split(
+            dir.path(),
+            "m1.parquet",
+            &["alpha", "beta", "gamma"],
+            &[100, 100, 100],
+            &[1.0, 2.0, 3.0],
+            &[1, 2, 3],
+        );
+        let input2 = write_test_split(
+            dir.path(),
+            "m2.parquet",
+            &["alpha", "beta", "gamma"],
+            &[200, 200, 200],
+            &[4.0, 5.0, 6.0],
+            &[1, 2, 3],
+        );
+
+        assert_engine_parity(&[input1, input2], 3);
+    }
+
+    /// Stress variant: 3 inputs × 3 metrics with the per-metric keyspaces
+    /// **overlapping across inputs** (each metric appears in every input
+    /// with the same `timeseries_id` range and overlapping but distinct
+    /// timestamps). The merge must heavily interleave rows from all three
+    /// inputs, not concatenate them. Asserts engine-vs-engine parity plus
+    /// the multi-output contract (disjointness across the three outputs).
+    #[test]
+    fn parity_multi_metric_overlapping_inputs_multi_output() {
+        let dir = TempDir::new().unwrap();
+
+        // Each input has 3 metrics × 10 rows. For metric=alpha, ts_id_base
+        // depends only on the metric name (see write_test_split), so all
+        // three inputs share the same set of timeseries_ids per metric.
+        // Timestamps are chosen so each (metric, timeseries) appears in
+        // all three inputs at three overlapping-but-distinct times — the
+        // merge must interleave row-by-row.
+        let input_x = write_test_split(
+            dir.path(),
+            "x.parquet",
+            // Per-input layout: alpha x10, beta x10, gamma x10 = 30 rows.
+            &[
+                "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha",
+                "alpha", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta",
+                "beta", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma",
+                "gamma", "gamma",
+            ],
+            // input-x: each metric at ts 100..109.
+            &[
+                100, 101, 102, 103, 104, 105, 106, 107, 108, 109, // alpha
+                100, 101, 102, 103, 104, 105, 106, 107, 108, 109, // beta
+                100, 101, 102, 103, 104, 105, 106, 107, 108, 109, // gamma
+            ],
+            &[
+                1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5,
+                2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9,
+            ],
+            // Each (metric, row-position) within the input has a unique
+            // ts_id, but cross-input collisions on the same (metric, pos)
+            // ARE intentional — that's what makes the merge interleave.
+            &[
+                1, 2, 3, 4, 5, 6, 7, 8, 9, 10, // alpha: ts_ids 1-10
+                11, 12, 13, 14, 15, 16, 17, 18, 19, 20, // beta: 11-20
+                21, 22, 23, 24, 25, 26, 27, 28, 29, 30, // gamma: 21-30
+            ],
+        );
+        let input_y = write_test_split(
+            dir.path(),
+            "y.parquet",
+            &[
+                "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha",
+                "alpha", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta",
+                "beta", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma",
+                "gamma", "gamma",
+            ],
+            // input-y: each metric at ts 110..119.
+            &[
+                110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 110, 111, 112, 113, 114, 115,
+                116, 117, 118, 119, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
+            ],
+            &[
+                4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5,
+                5.6, 5.7, 5.8, 5.9, 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9,
+            ],
+            // Same ts_id ranges as input-x — collisions intentional.
+            &[
+                1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
+                24, 25, 26, 27, 28, 29, 30,
+            ],
+        );
+        let input_z = write_test_split(
+            dir.path(),
+            "z.parquet",
+            &[
+                "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha", "alpha",
+                "alpha", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta", "beta",
+                "beta", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma", "gamma",
+                "gamma", "gamma",
+            ],
+            // input-z: each metric at ts 105..114 (interleaved with x and y).
+            &[
+                105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 105, 106, 107, 108, 109, 110,
+                111, 112, 113, 114, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
+            ],
+            &[
+                7.0, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8.0, 8.1, 8.2, 8.3, 8.4, 8.5,
+                8.6, 8.7, 8.8, 8.9, 9.0, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9,
+            ],
+            &[
+                1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
+                24, 25, 26, 27, 28, 29, 30,
+            ],
+        );
+
+        // Three outputs targeted. The engine splits at sorted_series
+        // boundaries; with three distinct metric_name values and a
+        // single timeseries_id per (metric, row-position), there are
+        // enough natural boundaries to produce three outputs.
+        assert_engine_parity(&[input_x, input_y, input_z], 3);
+    }
+}
+
 // ---- Proptest DST: property-based invariant verification ----
 
 mod proptests {
diff --git a/quickwit/quickwit-parquet-engine/src/merge/writer.rs b/quickwit/quickwit-parquet-engine/src/merge/writer.rs
index 47f1dcb65f1..6833820cf0c 100644
--- a/quickwit/quickwit-parquet-engine/src/merge/writer.rs
+++ b/quickwit/quickwit-parquet-engine/src/merge/writer.rs
@@ -169,6 +169,7 @@ pub fn write_merge_outputs(
             path: output_path,
             num_rows: sorted_batch.num_rows(),
             num_row_groups: written.num_row_groups,
+            output_rg_partition_prefix_len: output_prefix_len,
             size_bytes: written.size_bytes,
             row_keys_proto,
             zonemap_regexes,
@@ -185,7 +186,7 @@ pub fn write_merge_outputs(
 ///
 /// Takes the relevant row ranges from each input according to the merge runs,
 /// concatenates into a single batch, and applies the permutation via `take`.
-fn apply_merge_permutation(
+pub(super) fn apply_merge_permutation(
     inputs: &[RecordBatch],
     union_schema: &SchemaRef,
     runs: &[MergeRun],
@@ -254,7 +255,7 @@ fn predict_num_row_groups(num_rows: usize, row_group_size: usize) -> usize {
 /// `qh.rg_partition_prefix_len` KV — caller computes this based on
 /// whether the file is going to be single-RG (preserve input prefix)
 /// or multi-RG (must be 0).
-fn build_merge_kv_metadata(
+pub(super) fn build_merge_kv_metadata(
     input_meta: &InputMetadata,
     row_keys_proto: &Option<Vec<u8>>,
     zonemap_regexes: &std::collections::HashMap<String, String>,
@@ -324,7 +325,10 @@ fn build_merge_kv_metadata(
 }
 
 /// Build `SortingColumn` entries for Parquet file metadata.
-fn build_sorting_columns(batch: &RecordBatch, sort_fields_str: &str) -> Result<Vec<SortingColumn>> {
+pub(super) fn build_sorting_columns(
+    batch: &RecordBatch,
+    sort_fields_str: &str,
+) -> Result<Vec<SortingColumn>> {
     let sort_schema = parse_sort_fields(sort_fields_str)?;
     let schema = batch.schema();
 
@@ -347,7 +351,7 @@ fn build_sorting_columns(batch: &RecordBatch, sort_fields_str: &str) -> Result<V
 
 /// Resolve sort field names from the sort schema string.
 /// Normalizes legacy names (e.g. "timestamp" → "timestamp_secs").
-fn resolve_sort_field_names(sort_fields_str: &str) -> Result<Vec<String>> {
+pub(super) fn resolve_sort_field_names(sort_fields_str: &str) -> Result<Vec<String>> {
     let sort_schema = parse_sort_fields(sort_fields_str)?;
     Ok(sort_schema
         .column
@@ -361,7 +365,7 @@ fn resolve_sort_field_names(sort_fields_str: &str) -> Result<Vec<String>> {
 ///
 /// Checks that sorted_series values are non-decreasing, and within equal
 /// sorted_series values, timestamp_secs respects the schema's sort direction.
-fn verify_sort_order(batch: &RecordBatch, sort_fields_str: &str) {
+pub(super) fn verify_sort_order(batch: &RecordBatch, sort_fields_str: &str) {
     if batch.num_rows() <= 1 {
         return;
     }
diff --git a/quickwit/quickwit-parquet-engine/src/sorted_series/mod.rs b/quickwit/quickwit-parquet-engine/src/sorted_series/mod.rs
index e7fac1e8047..2e9db1203dd 100644
--- a/quickwit/quickwit-parquet-engine/src/sorted_series/mod.rs
+++ b/quickwit/quickwit-parquet-engine/src/sorted_series/mod.rs
@@ -234,29 +234,16 @@ fn encode_row_key(
     buf: &mut Vec<u8>,
 ) -> Result<()> {
     // Encode non-null sort schema columns: ordinal + string value.
+    // Null columns are skipped entirely so the next column's higher
+    // ordinal byte appears in their place — that's how nulls-last
+    // ordering is achieved without a sentinel marker.
     for kc in tag_columns {
         let col = batch.column(kc.batch_idx);
         if col.is_null(row_idx) {
             continue;
         }
         let value = extract_string_value(col.as_ref(), row_idx)?;
-        if kc.descending {
-            // Ordinal is written normally (ascending) so that null rows
-            // (which skip this column entirely) sort after non-null rows
-            // — matching the writer's nulls_first=false behavior.
-            // Only the value bytes are inverted to reverse the sort order.
-            storekey::encode(&mut *buf, &kc.ordinal)
-                .map_err(|e| anyhow!("storekey encode ordinal: {}", e))?;
-            let start = buf.len();
-            storekey::encode(&mut *buf, value)
-                .map_err(|e| anyhow!("storekey encode value: {}", e))?;
-            invert_bytes(&mut buf[start..]);
-        } else {
-            storekey::encode(&mut *buf, &kc.ordinal)
-                .map_err(|e| anyhow!("storekey encode ordinal: {}", e))?;
-            storekey::encode(&mut *buf, value)
-                .map_err(|e| anyhow!("storekey encode value: {}", e))?;
-        }
+        append_prefix_col_to_key(buf, kc.ordinal, value, kc.descending)?;
     }
 
     // Append timeseries_id with its ordinal as the final discriminator.
@@ -270,14 +257,47 @@ fn encode_row_key(
         row_idx
     );
     let ts_id = extract_i64_value(col.as_ref(), row_idx);
-    storekey::encode(&mut *buf, &ts_id_column.ordinal)
-        .map_err(|e| anyhow!("storekey encode timeseries_id ordinal: {}", e))?;
-    storekey::encode(&mut *buf, &ts_id)
-        .map_err(|e| anyhow!("storekey encode timeseries_id: {}", e))?;
+    append_prefix_col_to_key(buf, ts_id_column.ordinal, &ts_id, false)?;
 
     Ok(())
 }
 
+/// Append `(ordinal, value)` to `buf` using the project's standard
+/// storekey-based sort-prefix encoding. This is the single source of
+/// truth for the on-the-wire format of one sort-schema column's
+/// contribution to a composite key — used by both
+/// [`compute_sorted_series_column`] (per-row sorted_series key) and
+/// `merge::region_grouping::extract_rg_composite_prefix_key`
+/// (per-RG prefix key for region grouping). Because both call sites
+/// share this encoding, a per-RG prefix key is a *literal byte
+/// prefix* of every sorted_series key produced by rows in that RG.
+///
+/// Layout: `storekey(ordinal: u8) || storekey(value)`. For descending
+/// columns the *value* bytes are inverted in place (NOT the ordinal
+/// byte) so memcmp on the composite reverses the value's lex order
+/// while ordinals stay in declared order. Caller skips this function
+/// entirely for null columns — the next column's higher ordinal byte
+/// then appears in this column's place, which gives nulls-last
+/// ordering without a sentinel marker (matches the writer's
+/// `nulls_first=false` convention).
+pub(crate) fn append_prefix_col_to_key<T>(
+    buf: &mut Vec<u8>,
+    ordinal: u8,
+    value: &T,
+    descending: bool,
+) -> Result<()>
+where
+    T: ?Sized + storekey::Encode,
+{
+    storekey::encode(&mut *buf, &ordinal).map_err(|e| anyhow!("storekey encode ordinal: {}", e))?;
+    let value_start = buf.len();
+    storekey::encode(&mut *buf, value).map_err(|e| anyhow!("storekey encode value: {}", e))?;
+    if descending {
+        invert_bytes(&mut buf[value_start..]);
+    }
+    Ok(())
+}
+
 /// Bitwise-NOT a byte slice in place, inverting the sort order for
 /// descending columns in the composite key. This is the standard
 /// ordered-code technique: if ascending bytes A < B, then !A > !B,
diff --git a/quickwit/quickwit-parquet-engine/src/storage/legacy_adapter.rs b/quickwit/quickwit-parquet-engine/src/storage/legacy_adapter.rs
new file mode 100644
index 00000000000..91383f8997f
--- /dev/null
+++ b/quickwit/quickwit-parquet-engine/src/storage/legacy_adapter.rs
@@ -0,0 +1,1960 @@
+// Copyright 2021-Present Datadog, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! Legacy multi-row-group input adapter.
+//!
+//! [`StreamingParquetReader`] yields pages in storage order so PR-6's
+//! merge engine can copy them column-major without buffering across row
+//! groups. That works only when the input file's row-group layout
+//! aligns with the sort prefix — i.e., new-format files where
+//! `qh.rg_partition_prefix_len > 0` so each RG is a contiguous run of
+//! the sort prefix, or single-row-group files (the trivial alignment).
+//!
+//! Legacy files that pre-date PR-3 carry `qh.rg_partition_prefix_len ==
+//! 0` AND `num_row_groups > 1`. Their RG boundaries land at arbitrary
+//! row counts inside the sort order, so column-major streaming through
+//! the merge driver isn't possible without buffering across RGs.
+//!
+//! [`LegacyInputAdapter`] handles that case by buffering the whole
+//! file, decoding it through Arrow, concatenating into a single
+//! [`RecordBatch`], and re-encoding it as a prefix-aligned multi-row-
+//! group parquet stream that [`StreamingParquetReader`] can serve.
+//! The adapter splits the consolidated batch at first-sort-col
+//! transitions (typically `metric_name`) and declares
+//! `qh.rg_partition_prefix_len = 1` on the re-encoded file so the
+//! merge engine's prefix-aware fast path can consume it. The original
+//! file is already sorted (legacy files were written sorted), so
+//! consolidating then re-splitting preserves order automatically —
+//! the adapter does NOT re-sort.
+//!
+//! When the original file lacks a `qh.sort_fields` KV or its first
+//! sort column can't be resolved in the schema, the adapter falls
+//! back to a single-row-group re-encode without claiming any prefix
+//! alignment. That route is still valid as input to the merge engine
+//! — it just goes through the engine's `prefix_len = 0` sub-region
+//! splitting path instead of the fast prefix-aligned path.
+//!
+//! Costs: one full-file decode + one full-file re-encode per legacy
+//! input, per merge. This is acceptable because legacy files age out
+//! as they're re-merged in the new format.
+
+// `parquet::format` is the only public path to `PageType` in parquet
+// 58 (the non-deprecated replacements are crate-private). The
+// `format` module is scheduled for removal in parquet 59 and we'll
+// migrate when the new public path lands. Tests in this module
+// inspect page types to verify per-column data-page counts; allowing
+// deprecated items at module scope keeps that lookup direct.
+#![allow(deprecated)]
+
+use std::collections::HashMap;
+use std::io;
+use std::ops::Range;
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+
+use arrow::array::{ArrayRef, NullArray, RecordBatch};
+use arrow::row::{RowConverter, SortField};
+use async_trait::async_trait;
+use bytes::Bytes;
+use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
+use parquet::errors::ParquetError;
+use parquet::file::metadata::{KeyValue, ParquetMetaData, SortingColumn};
+use parquet::file::properties::WriterProperties;
+use thiserror::Error;
+use tokio::io::AsyncRead;
+
+use super::config::ParquetWriterConfig;
+use super::streaming_reader::{
+    ColumnPageStream, Page, ParquetReadError, RemoteByteSource, StreamingParquetReader,
+};
+use super::streaming_writer::StreamingParquetWriter;
+use super::writer::{
+    PARQUET_META_RG_PARTITION_PREFIX_LEN, PARQUET_META_SORT_FIELDS, ParquetWriteError,
+};
+use crate::sort_fields::{is_timestamp_column_name, parse_sort_fields};
+
+/// Errors from the legacy input adapter.
+///
+/// Each variant preserves the underlying error so callers can
+/// distinguish I/O blips (retry/backoff) from genuine file corruption.
+/// In particular, an I/O error on the buffered GET surfaces as
+/// [`Self::Io`] — never silently re-reported as a decode error.
+#[derive(Error, Debug)]
+pub enum LegacyAdapterError {
+    /// I/O error from the underlying [`RemoteByteSource`].
+    #[error("io error: {0}")]
+    Io(#[from] io::Error),
+
+    /// Parquet decode error while reading the original file (footer
+    /// parse, page decompression, etc.).
+    #[error("failed to decode legacy parquet file: {0}")]
+    ParquetDecode(#[source] ParquetError),
+
+    /// Arrow decode error while consuming the original file's record
+    /// batches. Distinct from [`Self::ParquetDecode`] so the caller
+    /// sees the real cause (arrow projection, type coercion, etc.)
+    /// rather than a generic decode error.
+    #[error("failed to decode legacy parquet record batches: {0}")]
+    ArrowDecode(#[source] arrow::error::ArrowError),
+
+    /// Re-encoding the consolidated [`RecordBatch`] into a single-row-
+    /// group parquet stream failed.
+    #[error("failed to re-encode legacy parquet file as single row group: {0}")]
+    ParquetReencode(#[from] ParquetWriteError),
+
+    /// The streaming reader over the re-encoded in-memory buffer
+    /// failed. Because the in-memory source cannot fail with I/O, this
+    /// almost always indicates a logic bug in the re-encode path.
+    #[error("failed to open streaming reader over re-encoded buffer: {0}")]
+    StreamingReader(#[from] ParquetReadError),
+
+    /// The original file is too large to buffer in memory. The adapter
+    /// is the legacy fallback path; a defensively-sized cap protects
+    /// against pathological inputs.
+    #[error("legacy input file is too large to buffer: {actual} bytes exceeds limit {limit}")]
+    InputTooLarge { actual: u64, limit: u64 },
+
+    /// The caller asked for `target_prefix_len > 0` but the file does
+    /// not advertise enough sort information to honor the request:
+    /// `qh.sort_fields` is absent, or the sort-fields string declares
+    /// fewer columns than requested. Either case means the file lacks
+    /// a name for one of the first `target_prefix_len` sort columns,
+    /// so the adapter can't claim alignment on a column it can't
+    /// identify. (Prefix columns that are *named* in `qh.sort_fields`
+    /// but missing from the arrow schema are NOT an error — per SS-3
+    /// the adapter treats them as implicitly null at every row, which
+    /// trivially satisfies alignment on that column.) The caller
+    /// should retry with a smaller `target_prefix_len` or pass `0` to
+    /// fall through to the single-row-group re-encode.
+    #[error(
+        "cannot honor target_prefix_len = {target}: {reason} (the legacy file does not advertise \
+         enough sort information to safely synthesize prefix-aligned row groups)"
+    )]
+    PrefixUnresolvable { target: u32, reason: String },
+
+    /// The legacy file's rows are not sorted by its declared sort schema
+    /// (SS-1 violation): two row regions in the file carry the same
+    /// composite prefix value with other prefix values in between. The
+    /// adapter walks rows in physical order and emits one RG per
+    /// prefix-value run, so an unsorted input produces multiple RGs
+    /// sharing a prefix key — which violates PA-3 (per-input uniqueness).
+    /// Bail upfront instead of producing a file the downstream merge
+    /// engine will reject mid-merge.
+    #[error(
+        "legacy input is not sorted by its declared sort schema: rows at offset {first_offset} \
+         and offset {second_offset} share composite prefix value (target_prefix_len = {target}). \
+         The adapter relies on the file being sorted per SS-1; an unsorted file would synthesize \
+         multiple row groups with the same prefix key (PA-3 violation)."
+    )]
+    InputNotSorted {
+        target: u32,
+        first_offset: usize,
+        second_offset: usize,
+    },
+}
+
+/// 4 GiB upper bound on the input file size we will buffer into RAM.
+/// Legacy parquet metrics splits in production are well under 1 GiB;
+/// this is a runaway bound, not a typical-case budget.
+const MAX_LEGACY_INPUT_BYTES: u64 = 4 * 1024 * 1024 * 1024;
+
+/// Adapter that exposes a legacy multi-row-group parquet file through
+/// the [`ColumnPageStream`] contract.
+///
+/// Internally buffers the original file, re-encodes it as a single-row-
+/// group parquet stream in memory, and serves pages through a
+/// [`StreamingParquetReader`] over that buffer. Preserves the original
+/// file's `key_value_metadata` (the `qh.*` keys) and `sorting_columns`.
+pub struct LegacyInputAdapter {
+    inner: StreamingParquetReader,
+}
+
+impl LegacyInputAdapter {
+    /// Open the legacy file at `path` through `source`, re-encode it
+    /// into a prefix-aligned parquet stream advertising
+    /// `qh.rg_partition_prefix_len = target_prefix_len`, and prepare
+    /// to serve its pages.
+    ///
+    /// The caller picks `target_prefix_len` based on what the rest of
+    /// the merge plan expects. Typical sources:
+    /// - Match the consensus `rg_partition_prefix_len` of the non-legacy inputs in the same merge
+    ///   (so all inputs end up at one value).
+    /// - Pass `0` when there is no non-legacy input, which produces a single-row-group re-encode
+    ///   and no prefix-alignment claim — the merge engine's `prefix_len = 0` sub-region splitting
+    ///   path handles it.
+    ///
+    /// When `target_prefix_len > 0`, the adapter slices the
+    /// consolidated batch at every transition of the first
+    /// `target_prefix_len` sort columns (composite key, via
+    /// [`RowConverter`]) and emits one output row group per slice.
+    /// Returns an error if the file does not have enough resolvable
+    /// sort columns to honor the request — the caller should either
+    /// retry with a smaller `target_prefix_len` or fall back to `0`.
+    ///
+    /// Issues exactly one buffered GET against `source` (covering the
+    /// whole file). All subsequent reads are served from the in-memory
+    /// re-encoded buffer.
+    pub async fn try_open(
+        source: Arc<dyn RemoteByteSource>,
+        path: PathBuf,
+        target_prefix_len: u32,
+    ) -> Result<Self, LegacyAdapterError> {
+        let file_size = source.file_size(&path).await?;
+        if file_size > MAX_LEGACY_INPUT_BYTES {
+            return Err(LegacyAdapterError::InputTooLarge {
+                actual: file_size,
+                limit: MAX_LEGACY_INPUT_BYTES,
+            });
+        }
+
+        let buffered = source.get_slice(&path, 0..file_size).await?;
+        let reencoded_bytes = reencode_prefix_aligned(buffered, target_prefix_len)?;
+        let reencoded_source: Arc<dyn RemoteByteSource> = Arc::new(InMemoryByteSource {
+            bytes: Bytes::from(reencoded_bytes),
+        });
+        let inner = StreamingParquetReader::try_open(reencoded_source, path).await?;
+        Ok(Self { inner })
+    }
+
+    /// Parsed metadata for the re-encoded single-row-group file.
+    ///
+    /// Available without further I/O. Schema, `qh.*` key-value entries,
+    /// and `sorting_columns` are preserved from the original input.
+    pub fn metadata(&self) -> &Arc<ParquetMetaData> {
+        self.inner.metadata()
+    }
+}
+
+#[async_trait]
+impl ColumnPageStream for LegacyInputAdapter {
+    fn metadata(&self) -> &Arc<ParquetMetaData> {
+        self.inner.metadata()
+    }
+
+    async fn next_page(&mut self) -> Result<Option<Page>, ParquetReadError> {
+        self.inner.next_page().await
+    }
+}
+
+/// Decode `bytes` into a single concatenated [`RecordBatch`], then
+/// re-encode it according to `target_prefix_len`:
+/// - `target_prefix_len == 0`: emit a single row group with no prefix alignment claim. The original
+///   `qh.*` KV (which typically omits `qh.rg_partition_prefix_len`) is preserved verbatim. The
+///   merge engine's `prefix_len = 0` sub-region splitting path consumes this without further
+///   plumbing.
+/// - `target_prefix_len > 0`: slice the consolidated batch at every transition of the first
+///   `target_prefix_len` sort columns (composite key, via [`RowConverter`]) and emit one row group
+///   per distinct composite value. Stamp the output's KV with `qh.rg_partition_prefix_len =
+///   target_prefix_len` so the merge engine's prefix-aware fast path takes over.
+///
+/// When `target_prefix_len > 0` and the requested alignment cannot be
+/// honored — `qh.sort_fields` is absent, the sort-fields string
+/// declares fewer columns than requested, or one of the first N
+/// columns is missing from the arrow schema — returns
+/// [`LegacyAdapterError::PrefixUnresolvable`]. The caller can retry
+/// with a smaller `target_prefix_len` or fall back to `0`.
+///
+/// The zero-rows-but-`target_prefix_len > 0` case is degenerate but
+/// still stamps the KV: an empty file vacuously satisfies any prefix
+/// alignment claim.
+fn reencode_prefix_aligned(
+    bytes: Bytes,
+    target_prefix_len: u32,
+) -> Result<Vec<u8>, LegacyAdapterError> {
+    let builder = ParquetRecordBatchReaderBuilder::try_new(bytes)
+        .map_err(LegacyAdapterError::ParquetDecode)?;
+
+    let arrow_schema = builder.schema().clone();
+    let original_metadata = builder.metadata().clone();
+    let original_kv: Option<Vec<KeyValue>> = original_metadata
+        .file_metadata()
+        .key_value_metadata()
+        .cloned();
+    let original_sorting_cols: Option<Vec<SortingColumn>> =
+        carry_sorting_columns(&original_metadata);
+
+    let reader = builder.build().map_err(LegacyAdapterError::ParquetDecode)?;
+    let mut decoded_batches: Vec<RecordBatch> = Vec::new();
+    for batch_result in reader {
+        let batch = batch_result.map_err(LegacyAdapterError::ArrowDecode)?;
+        decoded_batches.push(batch);
+    }
+
+    let consolidated_batch = arrow::compute::concat_batches(&arrow_schema, &decoded_batches)
+        .map_err(LegacyAdapterError::ArrowDecode)?;
+
+    if target_prefix_len == 0 {
+        // Single-RG passthrough: re-encode the input as one row group
+        // with no prefix-alignment claim, so downstream readers take
+        // the legacy `prefix_len = 0` path.
+        //
+        // Preserving the original footer KVs unchanged would leak any
+        // pre-existing `qh.rg_partition_prefix_len = N` (nonzero)
+        // claim from the input into the output. The output is *one*
+        // row group whose rows can carry multiple prefix-column
+        // values, so advertising N > 0 would cause downstream to take
+        // the prefix-aware path and then fail PA-1 on the multi-value
+        // RG. Strip the prefix KV instead; absence of the key is the
+        // legacy convention for "no alignment claim".
+        let cleaned_kv = original_kv.map(|mut kvs| {
+            kvs.retain(|k| k.key != PARQUET_META_RG_PARTITION_PREFIX_LEN);
+            kvs
+        });
+        let props = build_writer_properties(
+            &arrow_schema,
+            original_sorting_cols.unwrap_or_default(),
+            cleaned_kv,
+            consolidated_batch.num_rows(),
+        );
+        return write_single_row_group(arrow_schema, props, consolidated_batch);
+    }
+
+    let prefix_col_indices =
+        resolve_prefix_sort_cols(original_kv.as_ref(), &arrow_schema, target_prefix_len)?;
+
+    let slices = if consolidated_batch.num_rows() == 0 {
+        Vec::new()
+    } else {
+        compute_prefix_value_slices(&consolidated_batch, &prefix_col_indices, target_prefix_len)?
+    };
+    let kv_with_prefix = inject_prefix_len_kv(original_kv, target_prefix_len);
+    let props = build_writer_properties(
+        &arrow_schema,
+        original_sorting_cols.unwrap_or_default(),
+        Some(kv_with_prefix),
+        consolidated_batch.num_rows(),
+    );
+    write_multi_row_group(arrow_schema, props, consolidated_batch, &slices)
+}
+
+/// Resolve the first `prefix_len` sort columns from `qh.sort_fields`
+/// to arrow-schema indices. Honors the
+/// `timestamp` / `timestamp_secs` alias the rest of the engine uses.
+///
+/// Returns one entry per requested prefix column: `Some(idx)` if the
+/// column is present in the schema, or `None` if the column is
+/// declared in `qh.sort_fields` but absent from the arrow schema
+/// (treated as implicitly null at every row per SS-3). Returns
+/// [`LegacyAdapterError::PrefixUnresolvable`] only when the file
+/// doesn't advertise enough sort-column *names* (missing/unparseable
+/// `qh.sort_fields`, or declares fewer columns than requested) —
+/// those are cases where we don't even know which column the prefix
+/// alignment is supposed to be on.
+fn resolve_prefix_sort_cols(
+    kv: Option<&Vec<KeyValue>>,
+    arrow_schema: &arrow::datatypes::Schema,
+    prefix_len: u32,
+) -> Result<Vec<Option<usize>>, LegacyAdapterError> {
+    debug_assert!(prefix_len > 0);
+    let sort_fields_str = kv
+        .and_then(|kvs| kvs.iter().find(|k| k.key == PARQUET_META_SORT_FIELDS))
+        .and_then(|kv| kv.value.as_deref())
+        .ok_or_else(|| LegacyAdapterError::PrefixUnresolvable {
+            target: prefix_len,
+            reason: format!("{PARQUET_META_SORT_FIELDS} KV is absent"),
+        })?;
+    let parsed =
+        parse_sort_fields(sort_fields_str).map_err(|e| LegacyAdapterError::PrefixUnresolvable {
+            target: prefix_len,
+            reason: format!("{PARQUET_META_SORT_FIELDS} is unparseable: {e}"),
+        })?;
+    let prefix_len_usize = prefix_len as usize;
+    if parsed.column.len() < prefix_len_usize {
+        return Err(LegacyAdapterError::PrefixUnresolvable {
+            target: prefix_len,
+            reason: format!(
+                "{PARQUET_META_SORT_FIELDS} declares only {} sort columns",
+                parsed.column.len(),
+            ),
+        });
+    }
+    let mut indices = Vec::with_capacity(prefix_len_usize);
+    for sf in parsed.column.iter().take(prefix_len_usize) {
+        let resolved = if is_timestamp_column_name(&sf.name)
+            && arrow_schema.index_of("timestamp_secs").is_ok()
+        {
+            "timestamp_secs"
+        } else {
+            sf.name.as_str()
+        };
+        // Missing column → implicit null per SS-3. A column that is
+        // null at every row is constant, which trivially satisfies
+        // alignment on that column. The transition computation
+        // synthesizes a NullArray in its place.
+        //
+        // Adapter-vs-reader consistency note (Codex P2 on PR #6425):
+        // we record `None` here and the caller stamps
+        // `rg_partition_prefix_len = target_prefix_len` on the output.
+        // The matching reader-side handling — `find_prefix_parquet_col_indices`
+        // returning `Vec<Option<PrefixColumn>>` and synthesizing a
+        // constant `[0x00, 0x00]` byte sequence for `None` slots —
+        // lands in the next slice (PR #6426, the hardening PR; F12
+        // from the adversarial review). The only caller of this
+        // adapter is `execute_merge_operation` introduced in
+        // PR #6423, which sits above #6426 in the stack, so there is
+        // no production caller that can produce a missing-column
+        // prefix until the reader fix is in place.
+        indices.push(arrow_schema.index_of(resolved).ok());
+    }
+    Ok(indices)
+}
+
+/// Walk the composite prefix value row-by-row over the columns at
+/// `prefix_col_indices` and produce `(start, len)` slices, one per
+/// distinct composite-value run. Uses a single [`RowConverter`] over
+/// all prefix columns so dictionary / utf8 / primitive types are
+/// handled uniformly and N-column equality is a single byte
+/// comparison per row.
+///
+/// An entry of `None` in `prefix_col_indices` represents a prefix
+/// column that is named in `qh.sort_fields` but absent from the
+/// file's arrow schema. Per SS-3 those rows are treated as having
+/// null values, so this function materializes a [`NullArray`] of the
+/// batch's length in that slot. A column that's null at every row is
+/// constant and contributes no transitions to the composite key —
+/// equivalent to skipping it, but kept explicit so the resulting
+/// alignment claim matches the caller's requested `target_prefix_len`.
+///
+/// Detects SS-1 violations (unsorted input) up-front: each emitted
+/// slice's composite prefix-value bytes must be unique. If two
+/// non-adjacent slices carry the same prefix value (e.g., rows
+/// `[A,A,B,B,A,A]`), the input is not sorted by its declared sort
+/// schema, so we'd synthesize a file with two RGs sharing the prefix
+/// — a PA-3 violation the downstream merge engine would reject
+/// mid-merge. Bailing here with `InputNotSorted` keeps that bad file
+/// from ever landing on disk.
+fn compute_prefix_value_slices(
+    batch: &RecordBatch,
+    prefix_col_indices: &[Option<usize>],
+    target_prefix_len: u32,
+) -> Result<Vec<(usize, usize)>, LegacyAdapterError> {
+    let n = batch.num_rows();
+    let cols: Vec<ArrayRef> = prefix_col_indices
+        .iter()
+        .map(|idx_opt| match idx_opt {
+            Some(idx) => Arc::clone(batch.column(*idx)),
+            None => Arc::new(NullArray::new(n)) as ArrayRef,
+        })
+        .collect();
+    let sort_fields: Vec<SortField> = cols
+        .iter()
+        .map(|c| SortField::new(c.data_type().clone()))
+        .collect();
+    let converter = RowConverter::new(sort_fields).map_err(LegacyAdapterError::ArrowDecode)?;
+    let rows = converter
+        .convert_columns(&cols)
+        .map_err(LegacyAdapterError::ArrowDecode)?;
+    let n_rows = rows.num_rows();
+    if n_rows == 0 {
+        return Ok(Vec::new());
+    }
+    // Track each emitted slice's starting prefix-value bytes; any
+    // repeat signals SS-1 violation on the input.
+    let mut seen: HashMap<Vec<u8>, usize> = HashMap::new();
+    let mut slices = Vec::new();
+    let mut start = 0;
+    let record_slice = |slices: &mut Vec<(usize, usize)>,
+                        seen: &mut HashMap<Vec<u8>, usize>,
+                        slice_start: usize,
+                        slice_len: usize|
+     -> Result<(), LegacyAdapterError> {
+        let key = rows.row(slice_start).as_ref().to_vec();
+        if let Some(&first_offset) = seen.get(&key) {
+            return Err(LegacyAdapterError::InputNotSorted {
+                target: target_prefix_len,
+                first_offset,
+                second_offset: slice_start,
+            });
+        }
+        seen.insert(key, slice_start);
+        slices.push((slice_start, slice_len));
+        Ok(())
+    };
+    for i in 1..n_rows {
+        if rows.row(i) != rows.row(i - 1) {
+            record_slice(&mut slices, &mut seen, start, i - start)?;
+            start = i;
+        }
+    }
+    record_slice(&mut slices, &mut seen, start, n_rows - start)?;
+    Ok(slices)
+}
+
+/// Inject (or replace) the `qh.rg_partition_prefix_len` KV entry on
+/// the re-encoded file. Legacy files omit this key entirely; the
+/// re-encoded output advertises the synthesized prefix alignment so
+/// the merge engine's reader picks the fast path.
+fn inject_prefix_len_kv(original: Option<Vec<KeyValue>>, prefix_len: u32) -> Vec<KeyValue> {
+    let mut kvs = original.unwrap_or_default();
+    kvs.retain(|k| k.key != PARQUET_META_RG_PARTITION_PREFIX_LEN);
+    kvs.push(KeyValue::new(
+        PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+        prefix_len.to_string(),
+    ));
+    kvs
+}
+
+/// Write `batch` to a multi-row-group parquet stream: one RG per
+/// `(start, len)` slice in `slices`. Slices are emitted in order, so
+/// the sort order observed by readers matches the order of the
+/// consolidated batch.
+fn write_multi_row_group(
+    arrow_schema: arrow::datatypes::SchemaRef,
+    props: WriterProperties,
+    batch: RecordBatch,
+    slices: &[(usize, usize)],
+) -> Result<Vec<u8>, LegacyAdapterError> {
+    let mut out: Vec<u8> = Vec::new();
+    let mut writer = StreamingParquetWriter::try_new(&mut out, arrow_schema, props)?;
+    for &(start, len) in slices {
+        if len == 0 {
+            continue;
+        }
+        let mut row_group = writer.start_row_group()?;
+        for col_idx in 0..batch.num_columns() {
+            let slice = batch.column(col_idx).slice(start, len);
+            row_group.write_next_column(&slice)?;
+        }
+        row_group.finish()?;
+    }
+    writer.close()?;
+    Ok(out)
+}
+
+/// Read sorting columns from row group 0 of `metadata`, if present.
+/// Legacy files written by `ParquetWriter` set sorting_columns
+/// identically across row groups; preserving from RG0 is sufficient.
+fn carry_sorting_columns(metadata: &ParquetMetaData) -> Option<Vec<SortingColumn>> {
+    if metadata.num_row_groups() == 0 {
+        return None;
+    }
+    metadata.row_group(0).sorting_columns().cloned()
+}
+
+/// Build [`WriterProperties`] for the re-encoded file. Forces a single
+/// row group via `set_max_row_group_row_count(num_rows + 1)`, so all
+/// rows fit in one RG. Preserves `key_value_metadata` and
+/// `sorting_columns`; otherwise inherits the same compression /
+/// dictionary / bloom-filter defaults as production output.
+fn build_writer_properties(
+    arrow_schema: &arrow::datatypes::Schema,
+    sorting_cols: Vec<SortingColumn>,
+    kv_metadata: Option<Vec<KeyValue>>,
+    num_rows: usize,
+) -> WriterProperties {
+    let cfg = ParquetWriterConfig::default();
+    // Sort field names drive bloom-filter and dictionary configuration.
+    // We derive them from the carried sorting_columns so the output
+    // mirrors the original's encoding choices on the same columns.
+    let sort_field_names = sort_field_names_from_columns(arrow_schema, &sorting_cols);
+
+    let base_props = cfg.to_writer_properties_with_metadata(
+        arrow_schema,
+        sorting_cols,
+        kv_metadata,
+        &sort_field_names,
+    );
+
+    // Force a single row group: set the row-count cap above the total
+    // row count so the writer never rolls over. `num_rows + 1` is
+    // sufficient; saturating add guards against an unrealistic
+    // `usize::MAX`-sized input.
+    let single_rg_cap = num_rows.saturating_add(1).max(1);
+    base_props
+        .into_builder()
+        .set_max_row_group_row_count(Some(single_rg_cap))
+        .build()
+}
+
+/// Resolve `sorting_cols` (carrying parquet column indices) back to
+/// the corresponding arrow field names. Indices that fall outside the
+/// schema are skipped — this should not happen for well-formed files
+/// but we don't want a malformed legacy header to panic.
+fn sort_field_names_from_columns(
+    arrow_schema: &arrow::datatypes::Schema,
+    sorting_cols: &[SortingColumn],
+) -> Vec<String> {
+    let fields = arrow_schema.fields();
+    let mut names = Vec::with_capacity(sorting_cols.len());
+    for col in sorting_cols {
+        let idx = col.column_idx as usize;
+        if idx < fields.len() {
+            names.push(fields[idx].name().to_string());
+        }
+    }
+    names
+}
+
+/// Write `batch` into a single-row-group parquet stream using the
+/// streaming writer. Returns the encoded bytes.
+fn write_single_row_group(
+    arrow_schema: arrow::datatypes::SchemaRef,
+    props: WriterProperties,
+    batch: RecordBatch,
+) -> Result<Vec<u8>, LegacyAdapterError> {
+    let mut out: Vec<u8> = Vec::new();
+    let mut writer = StreamingParquetWriter::try_new(&mut out, arrow_schema, props)?;
+
+    // Even an empty input deserves a row group so downstream tooling
+    // sees a structurally consistent file (one RG, num_rows == 0). If
+    // the input has zero rows AND the streaming writer rejects an
+    // empty row group, fall back to closing without one — the merge
+    // engine treats a zero-row-group file as drained immediately.
+    if batch.num_rows() > 0 || batch.num_columns() > 0 {
+        let mut row_group = writer.start_row_group()?;
+        for col_idx in 0..batch.num_columns() {
+            row_group.write_next_column(batch.column(col_idx))?;
+        }
+        row_group.finish()?;
+    }
+
+    writer.close()?;
+    Ok(out)
+}
+
+/// In-memory [`RemoteByteSource`] backing a re-encoded buffer.
+///
+/// Private to this module: it exists only to feed a
+/// [`StreamingParquetReader`] over the re-encoded bytes. We
+/// deliberately do NOT expose it crate-wide.
+struct InMemoryByteSource {
+    bytes: Bytes,
+}
+
+#[async_trait]
+impl RemoteByteSource for InMemoryByteSource {
+    async fn file_size(&self, _path: &Path) -> io::Result<u64> {
+        Ok(self.bytes.len() as u64)
+    }
+
+    async fn get_slice(&self, _path: &Path, range: Range<u64>) -> io::Result<Bytes> {
+        let start = range.start as usize;
+        let end = range.end as usize;
+        if end > self.bytes.len() || start > end {
+            return Err(io::Error::other(format!(
+                "in-memory range {start}..{end} out of bounds for {} byte buffer",
+                self.bytes.len(),
+            )));
+        }
+        Ok(self.bytes.slice(start..end))
+    }
+
+    async fn get_slice_stream(
+        &self,
+        _path: &Path,
+        range: Range<u64>,
+    ) -> io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+        let start = range.start as usize;
+        let end = range.end as usize;
+        if end > self.bytes.len() || start > end {
+            return Err(io::Error::other(format!(
+                "in-memory range {start}..{end} out of bounds for {} byte buffer",
+                self.bytes.len(),
+            )));
+        }
+        let slice = self.bytes.slice(start..end);
+        Ok(Box::new(io::Cursor::new(slice.to_vec())))
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Mutex;
+    use std::sync::atomic::{AtomicUsize, Ordering};
+
+    use arrow::array::{
+        ArrayRef, DictionaryArray, Float64Array, Int64Array, StringArray, UInt8Array, UInt64Array,
+    };
+    use arrow::datatypes::{DataType, Field, Int32Type, Schema as ArrowSchema};
+    use parquet::arrow::ArrowWriter;
+    use parquet::file::properties::WriterProperties;
+
+    use super::*;
+    use crate::storage::ParquetWriterConfig;
+
+    // -------- Fixtures --------
+
+    fn make_metrics_batch(num_rows: usize) -> RecordBatch {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("value", DataType::Float64, false),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("service", dict_type, true),
+        ]));
+
+        let metric_keys: Vec<i32> = (0..num_rows as i32).map(|i| i % 2).collect();
+        let metric_values = StringArray::from(vec!["cpu.usage", "memory.used"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .expect("test dict array"),
+        );
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; num_rows]));
+        let timestamps: Vec<u64> = (0..num_rows as u64).map(|i| 1_700_000_000 + i).collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let values: Vec<f64> = (0..num_rows).map(|i| i as f64).collect();
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+        let tsids: Vec<i64> = (0..num_rows as i64).map(|i| 1000 + i).collect();
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+        // Service is nullable: every 5th row is null to exercise the
+        // null-mask preservation guarantee.
+        let svc_keys: Vec<Option<i32>> = (0..num_rows as i32)
+            .map(|i| if i % 5 == 0 { None } else { Some(i % 3) })
+            .collect();
+        let svc_values = StringArray::from(vec!["api", "db", "cache"]);
+        let service: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(svc_keys),
+                Arc::new(svc_values),
+            )
+            .expect("test dict array"),
+        );
+
+        RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                metric_type,
+                timestamp_secs,
+                value,
+                timeseries_id,
+                service,
+            ],
+        )
+        .expect("test batch")
+    }
+
+    fn writer_props_multi_rg(
+        arrow_schema: &ArrowSchema,
+        kvs: Vec<KeyValue>,
+        sorting_cols: Vec<SortingColumn>,
+        rows_per_rg: usize,
+    ) -> WriterProperties {
+        let cfg = ParquetWriterConfig::default();
+        let sort_field_names: Vec<String> =
+            sort_field_names_from_columns(arrow_schema, &sorting_cols);
+        let base = cfg.to_writer_properties_with_metadata(
+            arrow_schema,
+            sorting_cols,
+            Some(kvs),
+            &sort_field_names,
+        );
+        // Force multi-RG output: cap the row count per RG so we hit
+        // the rollover boundary on a moderate fixture.
+        base.into_builder()
+            .set_max_row_group_row_count(Some(rows_per_rg))
+            .build()
+    }
+
+    /// Write `batches` into a multi-row-group parquet file with
+    /// row-group rollover every `rows_per_rg` rows. Returns the bytes.
+    fn write_multi_rg_file(
+        batches: &[RecordBatch],
+        kvs: Vec<KeyValue>,
+        sorting_cols: Vec<SortingColumn>,
+        rows_per_rg: usize,
+    ) -> Bytes {
+        let arrow_schema = batches[0].schema();
+        let props = writer_props_multi_rg(&arrow_schema, kvs, sorting_cols, rows_per_rg);
+        let mut out: Vec<u8> = Vec::new();
+        {
+            let mut writer = ArrowWriter::try_new(&mut out, arrow_schema, Some(props))
+                .expect("test arrow writer");
+            for (idx, batch) in batches.iter().enumerate() {
+                writer.write(batch).expect("test write");
+                if idx + 1 < batches.len() {
+                    writer.flush().expect("test flush");
+                }
+            }
+            writer.close().expect("test close");
+        }
+        Bytes::from(out)
+    }
+
+    /// `RemoteByteSource` that records call counts so tests can assert
+    /// the adapter issues exactly one buffered GET.
+    struct CountingInMemorySource {
+        bytes: Bytes,
+        slice_calls: AtomicUsize,
+        last_slice_range: Mutex<Option<Range<u64>>>,
+    }
+
+    impl CountingInMemorySource {
+        fn new(bytes: Bytes) -> Arc<Self> {
+            Arc::new(Self {
+                bytes,
+                slice_calls: AtomicUsize::new(0),
+                last_slice_range: Mutex::new(None),
+            })
+        }
+    }
+
+    #[async_trait]
+    impl RemoteByteSource for CountingInMemorySource {
+        async fn file_size(&self, _path: &Path) -> io::Result<u64> {
+            Ok(self.bytes.len() as u64)
+        }
+
+        async fn get_slice(&self, _path: &Path, range: Range<u64>) -> io::Result<Bytes> {
+            self.slice_calls.fetch_add(1, Ordering::SeqCst);
+            *self.last_slice_range.lock().expect("test mutex") = Some(range.clone());
+            Ok(self.bytes.slice(range.start as usize..range.end as usize))
+        }
+
+        async fn get_slice_stream(
+            &self,
+            _path: &Path,
+            range: Range<u64>,
+        ) -> io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+            let slice = self.bytes.slice(range.start as usize..range.end as usize);
+            Ok(Box::new(io::Cursor::new(slice.to_vec())))
+        }
+    }
+
+    /// `RemoteByteSource` whose `get_slice` always fails with a
+    /// distinctive `io::Error`. Used to verify that the adapter
+    /// surfaces the underlying I/O error rather than masking it.
+    struct AlwaysFailingSliceSource {
+        file_size: u64,
+    }
+
+    #[async_trait]
+    impl RemoteByteSource for AlwaysFailingSliceSource {
+        async fn file_size(&self, _path: &Path) -> io::Result<u64> {
+            Ok(self.file_size)
+        }
+
+        async fn get_slice(&self, _path: &Path, _range: Range<u64>) -> io::Result<Bytes> {
+            Err(io::Error::other("simulated slice failure"))
+        }
+
+        async fn get_slice_stream(
+            &self,
+            _path: &Path,
+            _range: Range<u64>,
+        ) -> io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+            Err(io::Error::other("simulated slice stream failure"))
+        }
+    }
+
+    fn dummy_path() -> PathBuf {
+        PathBuf::from("legacy_test.parquet")
+    }
+
+    /// Build a multi-RG fixture whose rows are sorted by `metric_name`
+    /// (so consolidating them produces a batch with contiguous
+    /// metric_name runs, which is what the legacy adapter expects on
+    /// real legacy files). `metrics` is `(name, rows_per_metric)` in
+    /// the order they should appear; the writer rolls a new RG every
+    /// `rows_per_rg` so the multi-RG structure is exercised
+    /// independently of the metric_name partitioning.
+    fn write_sorted_multi_rg_legacy_file(
+        metrics: &[(&str, usize)],
+        sort_fields_value: &str,
+        rows_per_rg: usize,
+    ) -> Bytes {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("value", DataType::Float64, false),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("service", dict_type, true),
+        ]));
+
+        let total: usize = metrics.iter().map(|(_, n)| *n).sum();
+        let metric_names_vec: Vec<&str> = metrics.iter().map(|(name, _)| *name).collect();
+        let mut metric_keys: Vec<i32> = Vec::with_capacity(total);
+        let mut tsids: Vec<i64> = Vec::with_capacity(total);
+        let mut timestamps: Vec<u64> = Vec::with_capacity(total);
+        let mut values: Vec<f64> = Vec::with_capacity(total);
+        let mut row_idx: u64 = 0;
+        for (metric_idx, (_, count)) in metrics.iter().enumerate() {
+            for _ in 0..*count {
+                metric_keys.push(metric_idx as i32);
+                tsids.push(1000 + row_idx as i64);
+                // -timestamp_secs/V2 in the sort schema means
+                // timestamps DESC within a metric run.
+                timestamps.push(1_700_000_000 + (*count as u64) - (row_idx % *count as u64));
+                values.push(row_idx as f64);
+                row_idx += 1;
+            }
+        }
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(StringArray::from(metric_names_vec)),
+            )
+            .expect("metric dict"),
+        );
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; total]));
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+        let svc_keys: Vec<Option<i32>> = (0..total as i32).map(|i| Some(i % 3)).collect();
+        let service: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(svc_keys),
+                Arc::new(StringArray::from(vec!["api", "db", "cache"])),
+            )
+            .expect("svc dict"),
+        );
+        let batch = RecordBatch::try_new(
+            schema.clone(),
+            vec![
+                metric_name,
+                metric_type,
+                timestamp_secs,
+                value,
+                timeseries_id,
+                service,
+            ],
+        )
+        .expect("sorted fixture batch");
+
+        let kvs = vec![KeyValue::new(
+            PARQUET_META_SORT_FIELDS.to_string(),
+            sort_fields_value.to_string(),
+        )];
+        let sorting_cols = default_sorting_cols(&schema);
+        write_multi_rg_file(&[batch], kvs, sorting_cols, rows_per_rg)
+    }
+
+    fn default_sorting_cols(arrow_schema: &ArrowSchema) -> Vec<SortingColumn> {
+        vec![
+            SortingColumn {
+                column_idx: arrow_schema.index_of("metric_name").expect("test schema") as i32,
+                descending: false,
+                nulls_first: false,
+            },
+            SortingColumn {
+                column_idx: arrow_schema
+                    .index_of("timestamp_secs")
+                    .expect("test schema") as i32,
+                descending: true,
+                nulls_first: false,
+            },
+        ]
+    }
+
+    /// Drain all pages from a `ColumnPageStream`.
+    async fn drain_pages_via_trait(stream: &mut dyn ColumnPageStream) -> Vec<Page> {
+        let mut pages = Vec::new();
+        while let Some(p) = stream.next_page().await.expect("page read") {
+            pages.push(p);
+        }
+        pages
+    }
+
+    /// Read a parquet file from `bytes` into a single concatenated
+    /// `RecordBatch` for byte-equality comparisons.
+    fn read_back_to_single_batch(bytes: Bytes) -> RecordBatch {
+        let builder = ParquetRecordBatchReaderBuilder::try_new(bytes).expect("read back builder");
+        let arrow_schema = builder.schema().clone();
+        let reader = builder.build().expect("read back build");
+        let batches: Vec<RecordBatch> = reader.collect::<Result<Vec<_>, _>>().expect("read back");
+        if batches.is_empty() {
+            RecordBatch::new_empty(arrow_schema)
+        } else {
+            arrow::compute::concat_batches(&arrow_schema, &batches).expect("concat")
+        }
+    }
+
+    // -------- Tests --------
+
+    #[tokio::test]
+    async fn test_empty_multi_rg_input() {
+        let arrow_schema = make_metrics_batch(1).schema();
+        let empty_batch = RecordBatch::new_empty(arrow_schema.clone());
+        // Two empty row groups.
+        let bytes = write_multi_rg_file(
+            &[empty_batch.clone(), empty_batch],
+            Vec::new(),
+            default_sorting_cols(&arrow_schema),
+            1,
+        );
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source.clone(), dummy_path(), 0)
+            .await
+            .expect("adapter open");
+
+        let total_rows: i64 = (0..adapter.metadata().num_row_groups())
+            .map(|i| adapter.metadata().row_group(i).num_rows())
+            .sum();
+        assert_eq!(
+            total_rows, 0,
+            "row count must be preserved across re-encode"
+        );
+
+        // Buffered GET for the input file plus the streaming reader's
+        // footer GET against the in-memory re-encoded buffer. The
+        // counting source observes only the legacy buffered GET; the
+        // re-encoded buffer is served by the private `InMemoryByteSource`.
+        assert_eq!(
+            source.slice_calls.load(Ordering::SeqCst),
+            1,
+            "exactly one buffered GET against the legacy input",
+        );
+    }
+
+    #[tokio::test]
+    async fn test_multi_rg_consolidates_to_single_rg() {
+        // 3 RGs of 100 rows each. Default rows_per_rg=100 in
+        // `write_multi_rg_file` forces a flush every 100 rows.
+        let batch_a = make_metrics_batch(100);
+        let batch_b = make_metrics_batch(100);
+        let batch_c = make_metrics_batch(100);
+        let arrow_schema = batch_a.schema();
+        let bytes = write_multi_rg_file(
+            &[batch_a, batch_b, batch_c],
+            Vec::new(),
+            default_sorting_cols(&arrow_schema),
+            100,
+        );
+
+        // Sanity: confirm fixture really has multiple RGs.
+        let pre_builder =
+            ParquetRecordBatchReaderBuilder::try_new(bytes.clone()).expect("pre-builder");
+        assert!(
+            pre_builder.metadata().num_row_groups() >= 3,
+            "fixture must produce at least 3 row groups; got {}",
+            pre_builder.metadata().num_row_groups(),
+        );
+        let pre_total: i64 = (0..pre_builder.metadata().num_row_groups())
+            .map(|i| pre_builder.metadata().row_group(i).num_rows())
+            .sum();
+        assert_eq!(pre_total, 300);
+
+        let source = CountingInMemorySource::new(bytes);
+        let mut adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open");
+
+        assert_eq!(
+            adapter.metadata().num_row_groups(),
+            1,
+            "adapter must consolidate to a single row group",
+        );
+        assert_eq!(
+            adapter.metadata().row_group(0).num_rows(),
+            300,
+            "row count must be preserved",
+        );
+
+        let pages = drain_pages_via_trait(&mut adapter).await;
+        for p in &pages {
+            assert_eq!(p.rg_idx, 0, "all pages must be in the consolidated RG");
+        }
+    }
+
+    #[tokio::test]
+    async fn test_data_roundtrip_through_adapter() {
+        let batch_a = make_metrics_batch(50);
+        let batch_b = make_metrics_batch(50);
+        let arrow_schema = batch_a.schema();
+        let bytes = write_multi_rg_file(
+            &[batch_a, batch_b],
+            Vec::new(),
+            default_sorting_cols(&arrow_schema),
+            50,
+        );
+
+        // Oracle: read the original file directly into a single batch.
+        let oracle = read_back_to_single_batch(bytes.clone());
+        assert_eq!(oracle.num_rows(), 100);
+
+        // Adapter run: open through the adapter, drain pages to drive
+        // the streaming path through the re-encoded buffer, then
+        // verify the adapter's metadata exposes the expected schema
+        // and row count. Byte-equal data verification is performed
+        // against the consolidated batch we re-decode by going back
+        // through the streaming reader's contract: the in-memory
+        // re-encoded buffer is private, so we re-read the original
+        // file, and assert the consolidated row count + schema match
+        // the adapter's metadata.
+        let source = CountingInMemorySource::new(bytes);
+        let mut adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open");
+
+        assert_eq!(
+            adapter.metadata().row_group(0).num_rows() as usize,
+            oracle.num_rows(),
+        );
+        let adapter_schema = adapter.metadata().file_metadata().schema_descr();
+        assert_eq!(adapter_schema.num_columns(), oracle.num_columns());
+        for i in 0..adapter_schema.num_columns() {
+            assert_eq!(
+                adapter_schema.column(i).name(),
+                oracle.schema().field(i).name()
+            );
+        }
+
+        // Drain pages so we exercise the full streaming path.
+        let pages = drain_pages_via_trait(&mut adapter).await;
+        assert!(!pages.is_empty(), "non-empty input must yield pages");
+    }
+
+    #[tokio::test]
+    async fn test_kv_metadata_preserved() {
+        let kvs = vec![
+            KeyValue::new(
+                "qh.sort_fields".to_string(),
+                "metric_name asc, timestamp_secs desc".to_string(),
+            ),
+            KeyValue::new("qh.window_start_secs".to_string(), "1700000000".to_string()),
+        ];
+        let batch_a = make_metrics_batch(40);
+        let batch_b = make_metrics_batch(40);
+        let arrow_schema = batch_a.schema();
+        let bytes = write_multi_rg_file(
+            &[batch_a, batch_b],
+            kvs.clone(),
+            default_sorting_cols(&arrow_schema),
+            40,
+        );
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open");
+
+        let actual_kv = adapter
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .cloned()
+            .unwrap_or_default();
+        for expected in &kvs {
+            let found = actual_kv
+                .iter()
+                .find(|kv| kv.key == expected.key)
+                .unwrap_or_else(|| panic!("missing kv key {:?}", expected.key));
+            assert_eq!(
+                found.value, expected.value,
+                "value mismatch for key {:?}",
+                expected.key,
+            );
+        }
+    }
+
+    #[tokio::test]
+    async fn test_sorting_columns_preserved() {
+        let batch_a = make_metrics_batch(30);
+        let batch_b = make_metrics_batch(30);
+        let arrow_schema = batch_a.schema();
+        let sorting_cols = default_sorting_cols(&arrow_schema);
+        let bytes = write_multi_rg_file(&[batch_a, batch_b], Vec::new(), sorting_cols.clone(), 30);
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open");
+
+        let actual = adapter
+            .metadata()
+            .row_group(0)
+            .sorting_columns()
+            .cloned()
+            .expect("sorting_columns must be carried through");
+        assert_eq!(actual, sorting_cols);
+    }
+
+    #[tokio::test]
+    async fn test_dict_and_null_columns_preserved() {
+        // The fixture's `service` column is a Dictionary with nulls
+        // every 5th row. Round-trip through the adapter must preserve
+        // both the data values and the null mask.
+        let batch_a = make_metrics_batch(60);
+        let batch_b = make_metrics_batch(60);
+        let arrow_schema = batch_a.schema();
+        let bytes = write_multi_rg_file(
+            &[batch_a.clone(), batch_b.clone()],
+            Vec::new(),
+            default_sorting_cols(&arrow_schema),
+            60,
+        );
+        let oracle = read_back_to_single_batch(bytes.clone());
+
+        let source = CountingInMemorySource::new(bytes);
+        let mut adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open");
+        // Drain to drive the streaming path.
+        let pages = drain_pages_via_trait(&mut adapter).await;
+        assert!(!pages.is_empty());
+
+        // Verify the adapter's metadata describes the same physical
+        // schema (dictionary columns must remain dictionary-encoded
+        // physically — the writer config preserves dict encoding for
+        // Dictionary(_, _) fields by default).
+        let oracle_schema = oracle.schema();
+        let service_idx = oracle_schema
+            .index_of("service")
+            .expect("service column present");
+        let service_field = oracle_schema.field(service_idx);
+        assert!(
+            matches!(service_field.data_type(), DataType::Dictionary(_, _)),
+            "oracle service field must be Dictionary",
+        );
+        assert!(service_field.is_nullable(), "service must be nullable");
+
+        // Confirm the adapter's sum of data-page rows equals the input
+        // row count for every column, including the nullable dict.
+        let mut rows_per_col: std::collections::BTreeMap<usize, i64> =
+            std::collections::BTreeMap::new();
+        for p in pages {
+            if matches!(
+                p.header.type_,
+                parquet::format::PageType::DATA_PAGE | parquet::format::PageType::DATA_PAGE_V2,
+            ) {
+                let n = p
+                    .header
+                    .data_page_header
+                    .as_ref()
+                    .map(|h| h.num_values as i64)
+                    .or_else(|| {
+                        p.header
+                            .data_page_header_v2
+                            .as_ref()
+                            .map(|h| h.num_values as i64)
+                    })
+                    .unwrap_or(0);
+                *rows_per_col.entry(p.col_idx).or_insert(0) += n;
+            }
+        }
+        let expected_rows = oracle.num_rows() as i64;
+        for col_idx in 0..oracle.num_columns() {
+            let actual = rows_per_col.get(&col_idx).copied().unwrap_or(0);
+            assert_eq!(
+                actual, expected_rows,
+                "col {col_idx}: data-page num_values sum",
+            );
+        }
+    }
+
+    #[tokio::test]
+    async fn test_io_failure_surfaces_as_io_error() {
+        let source: Arc<dyn RemoteByteSource> = Arc::new(AlwaysFailingSliceSource {
+            // Pretend the file is non-empty so the adapter actually
+            // attempts the buffered GET (a zero-sized file would short-
+            // circuit).
+            file_size: 4096,
+        });
+
+        match LegacyInputAdapter::try_open(source, dummy_path(), 0).await {
+            Err(LegacyAdapterError::Io(err)) => {
+                assert!(
+                    err.to_string().contains("simulated"),
+                    "expected the simulated I/O error to be propagated; got {err}",
+                );
+            }
+            Err(other) => panic!(
+                "expected LegacyAdapterError::Io carrying the original io::Error; got error \
+                 variant: {other}",
+            ),
+            Ok(_) => panic!("expected adapter open to fail when get_slice errors"),
+        }
+    }
+
+    /// Cell-equal data round trip through the re-encode helper.
+    ///
+    /// `test_data_roundtrip_through_adapter` checks row count + schema
+    /// names through the streaming path; that catches dropped rows but
+    /// not value-level corruption (e.g., a hypothetical dictionary key
+    /// XOR or column-value swap during the decode/concat/re-encode
+    /// chain). This test calls `reencode_as_single_row_group` directly
+    /// against a fixture with both nullable and dictionary-encoded
+    /// columns, reads the re-encoded bytes back via the standard
+    /// reader, and asserts each column equals the oracle byte-for-byte.
+    #[test]
+    fn test_reencode_preserves_arrays_byte_equal() {
+        // Three RGs (50 rows each) so the consolidator actually has
+        // multiple input batches to concatenate. The fixture exercises
+        // dict columns and nulls in `service`.
+        let batch_a = make_metrics_batch(50);
+        let batch_b = make_metrics_batch(50);
+        let batch_c = make_metrics_batch(50);
+        let arrow_schema = batch_a.schema();
+        let bytes = write_multi_rg_file(
+            &[batch_a, batch_b, batch_c],
+            Vec::new(),
+            default_sorting_cols(&arrow_schema),
+            50,
+        );
+        let oracle = read_back_to_single_batch(bytes.clone());
+
+        let reencoded = reencode_prefix_aligned(bytes, 0).expect("reencode helper");
+        let reencoded_batch = read_back_to_single_batch(Bytes::from(reencoded));
+
+        assert_eq!(reencoded_batch.num_rows(), oracle.num_rows());
+        assert_eq!(reencoded_batch.num_columns(), oracle.num_columns());
+
+        let oracle_schema = oracle.schema();
+        for col_idx in 0..oracle.num_columns() {
+            let oracle_col = oracle.column(col_idx);
+            let reencoded_col = reencoded_batch.column(col_idx);
+            assert_eq!(
+                oracle_col.as_ref(),
+                reencoded_col.as_ref(),
+                "column '{}' (index {col_idx}) differs after re-encode",
+                oracle_schema.field(col_idx).name(),
+            );
+        }
+    }
+
+    #[tokio::test]
+    async fn test_satisfies_column_page_stream_trait() {
+        let batch_a = make_metrics_batch(80);
+        let batch_b = make_metrics_batch(80);
+        let arrow_schema = batch_a.schema();
+        let bytes = write_multi_rg_file(
+            &[batch_a, batch_b],
+            Vec::new(),
+            default_sorting_cols(&arrow_schema),
+            80,
+        );
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open");
+
+        // Inspect metadata via trait dispatch.
+        let num_rgs_via_trait = {
+            let stream: &dyn ColumnPageStream = &adapter;
+            stream.metadata().num_row_groups()
+        };
+        assert_eq!(num_rgs_via_trait, 1);
+
+        // Drain via trait dispatch and confirm idempotent EOF.
+        let mut adapter = adapter;
+        let pages = drain_pages_via_trait(&mut adapter).await;
+        assert!(!pages.is_empty());
+        for p in &pages {
+            assert_eq!(p.rg_idx, 0);
+        }
+        for _ in 0..3 {
+            let stream: &mut dyn ColumnPageStream = &mut adapter;
+            assert!(stream.next_page().await.expect("idempotent EOF").is_none());
+        }
+    }
+
+    /// Real legacy files carry `qh.sort_fields` and are written sorted
+    /// by the schema. The adapter must split the consolidated batch
+    /// into one RG per first-sort-col value and stamp the re-encoded
+    /// file with `qh.rg_partition_prefix_len = 1` so the merge engine
+    /// reads it through the prefix-aware fast path. The streaming
+    /// engine's duplicate-prefix invariant verifies on read that each
+    /// RG's metric_name is unique within the file; this test
+    /// indirectly exercises that contract.
+    #[tokio::test]
+    async fn test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg() {
+        let metrics = [
+            ("cpu.usage", 40usize),
+            ("memory.used", 40),
+            ("net.bytes", 40),
+        ];
+        // Force multi-RG layout in the input (rows_per_rg=30, smaller
+        // than any metric run) so the fixture proves the adapter
+        // collapses arbitrary input RG boundaries into prefix-aligned
+        // output RG boundaries.
+        let bytes =
+            write_sorted_multi_rg_legacy_file(&metrics, "metric_name|-timestamp_secs/V2", 30);
+        let pre_builder =
+            ParquetRecordBatchReaderBuilder::try_new(bytes.clone()).expect("pre-builder");
+        assert!(
+            pre_builder.metadata().num_row_groups() >= 2,
+            "fixture must produce multi-RG input; got {}",
+            pre_builder.metadata().num_row_groups(),
+        );
+
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 1)
+            .await
+            .expect("adapter open");
+
+        // Three distinct metric_names → three output RGs.
+        assert_eq!(
+            adapter.metadata().num_row_groups(),
+            3,
+            "adapter must emit one RG per distinct first-sort-col value",
+        );
+        let rg_rows: Vec<i64> = (0..adapter.metadata().num_row_groups())
+            .map(|i| adapter.metadata().row_group(i).num_rows())
+            .collect();
+        assert_eq!(rg_rows, vec![40, 40, 40], "row counts per RG");
+
+        // KV must advertise prefix_len = 1.
+        let kv = adapter
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .expect("kv metadata");
+        let prefix_kv = kv
+            .iter()
+            .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+            .and_then(|k| k.value.clone());
+        assert_eq!(
+            prefix_kv.as_deref(),
+            Some("1"),
+            "re-encoded file must declare rg_partition_prefix_len=1",
+        );
+
+        // F9 chunk-level verification: the count + KV checks above
+        // would still pass if `compute_prefix_value_slices` had an
+        // off-by-one in its boundary detection. PA-1 + PA-3 on chunk
+        // statistics nail down that each RG's metric_name column is
+        // actually constant and no two RGs share a value.
+        crate::merge::streaming::region_grouping::assert_unique_rg_prefix_keys(
+            adapter.metadata(),
+            "metric_name|-timestamp_secs/V2",
+            1,
+            "test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg adapter output",
+        )
+        .expect("adapter output must satisfy PA-1 + PA-3 on metric_name");
+    }
+
+    /// Single-metric legacy file: only one prefix value, so the
+    /// re-encoded file has exactly one RG (vacuously prefix-aligned).
+    /// The `qh.rg_partition_prefix_len = 1` KV is still set so the
+    /// reader's duplicate-prefix check has nothing to validate (one
+    /// RG can never violate the invariant) and the file looks
+    /// identical to a metric-aligned new-format file.
+    #[tokio::test]
+    async fn test_legacy_input_single_metric_yields_one_rg_with_prefix_kv() {
+        let metrics = [("cpu.usage", 90usize)];
+        let bytes =
+            write_sorted_multi_rg_legacy_file(&metrics, "metric_name|-timestamp_secs/V2", 30);
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 1)
+            .await
+            .expect("adapter open");
+
+        assert_eq!(adapter.metadata().num_row_groups(), 1);
+        assert_eq!(adapter.metadata().row_group(0).num_rows(), 90);
+
+        let prefix_kv = adapter
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .and_then(|kvs| {
+                kvs.iter()
+                    .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                    .and_then(|k| k.value.clone())
+            });
+        assert_eq!(prefix_kv.as_deref(), Some("1"));
+    }
+
+    /// `target_prefix_len = 0`: the adapter consolidates into a
+    /// single row group and does NOT stamp
+    /// `qh.rg_partition_prefix_len`, regardless of what the original
+    /// file's KV says. This is the "all-legacy merge with no non-
+    /// legacy peers" path — the merge engine's `prefix_len = 0`
+    /// sub-region splitting consumes it directly.
+    #[tokio::test]
+    async fn test_target_prefix_len_zero_passes_through_as_single_rg() {
+        // Even with a parseable sort_fields KV, target = 0 must not
+        // alter the layout or stamp the prefix KV.
+        let metrics = [("cpu.usage", 50usize), ("memory.used", 50)];
+        let bytes =
+            write_sorted_multi_rg_legacy_file(&metrics, "metric_name|-timestamp_secs/V2", 30);
+
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open");
+
+        assert_eq!(
+            adapter.metadata().num_row_groups(),
+            1,
+            "target_prefix_len = 0 must consolidate to a single RG",
+        );
+        let prefix_kv = adapter
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .and_then(|kvs| {
+                kvs.iter()
+                    .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                    .and_then(|k| k.value.clone())
+            });
+        assert!(
+            prefix_kv.is_none(),
+            "target_prefix_len = 0 must not stamp a prefix_len value; got {prefix_kv:?}",
+        );
+    }
+
+    /// Regression for Codex P2 on PR #6425: when the *input* file
+    /// already carries a stale nonzero `qh.rg_partition_prefix_len`
+    /// (e.g., it was produced by a prefix-aware writer and is now
+    /// being re-encoded through the legacy fallback path), passing
+    /// `target_prefix_len = 0` must STRIP that KV from the re-
+    /// encoded output. Without the strip, the consolidated single-RG
+    /// file would still advertise the stale prefix claim and
+    /// downstream metadata extraction would take the prefix-aware
+    /// path against a single RG that carries multiple first-prefix
+    /// values — failing PA-1 min/max on read.
+    #[tokio::test]
+    async fn test_target_prefix_len_zero_strips_stale_prefix_kv_from_input() {
+        // Build a 2-RG input with sort_fields AND a stale prefix KV
+        // (1) — simulating a prefix-aware input being demoted to the
+        // legacy single-RG path.
+        let metrics = [("cpu.usage", 30usize), ("memory.used", 30)];
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("value", DataType::Float64, false),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("service", dict_type, true),
+        ]));
+        let total: usize = metrics.iter().map(|(_, n)| *n).sum();
+        let mut metric_keys: Vec<i32> = Vec::with_capacity(total);
+        let mut tsids: Vec<i64> = Vec::with_capacity(total);
+        let mut timestamps: Vec<u64> = Vec::with_capacity(total);
+        let mut values: Vec<f64> = Vec::with_capacity(total);
+        let mut row_idx: u64 = 0;
+        for (metric_idx, (_, count)) in metrics.iter().enumerate() {
+            for _ in 0..*count {
+                metric_keys.push(metric_idx as i32);
+                tsids.push(1000 + row_idx as i64);
+                timestamps.push(1_700_000_000 + (*count as u64) - (row_idx % *count as u64));
+                values.push(row_idx as f64);
+                row_idx += 1;
+            }
+        }
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(StringArray::from(
+                    metrics.iter().map(|(n, _)| *n).collect::<Vec<_>>(),
+                )),
+            )
+            .expect("metric dict"),
+        );
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; total]));
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+        let svc_keys: Vec<Option<i32>> = (0..total as i32).map(|i| Some(i % 3)).collect();
+        let service: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(svc_keys),
+                Arc::new(StringArray::from(vec!["api", "db", "cache"])),
+            )
+            .expect("svc dict"),
+        );
+        let batch = RecordBatch::try_new(
+            schema.clone(),
+            vec![
+                metric_name,
+                metric_type,
+                timestamp_secs,
+                value,
+                timeseries_id,
+                service,
+            ],
+        )
+        .expect("batch");
+
+        let kvs = vec![
+            KeyValue::new(
+                PARQUET_META_SORT_FIELDS.to_string(),
+                "metric_name|-timestamp_secs/V2".to_string(),
+            ),
+            // The stale claim we want stripped.
+            KeyValue::new(
+                PARQUET_META_RG_PARTITION_PREFIX_LEN.to_string(),
+                "1".to_string(),
+            ),
+        ];
+        let bytes = write_multi_rg_file(&[batch], kvs, default_sorting_cols(&schema), 30);
+
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 0)
+            .await
+            .expect("adapter open with target_prefix_len = 0");
+
+        assert_eq!(
+            adapter.metadata().num_row_groups(),
+            1,
+            "target_prefix_len = 0 must consolidate to a single RG",
+        );
+        let prefix_kv = adapter
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .and_then(|kvs| {
+                kvs.iter()
+                    .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                    .and_then(|k| k.value.clone())
+            });
+        assert!(
+            prefix_kv.is_none(),
+            "stale `rg_partition_prefix_len = 1` from the input MUST be stripped when caller asks \
+             for target_prefix_len = 0; got {prefix_kv:?}",
+        );
+    }
+
+    /// `target_prefix_len > 0` on a file with no `qh.sort_fields` KV
+    /// must surface a `PrefixUnresolvable` error rather than silently
+    /// fall through. The caller decides whether to retry at a smaller
+    /// `target_prefix_len` or with `0`.
+    #[tokio::test]
+    async fn test_target_prefix_len_one_without_sort_fields_returns_unresolvable() {
+        // No sort_fields KV → cannot resolve any prefix column.
+        let batch_a = make_metrics_batch(40);
+        let arrow_schema = batch_a.schema();
+        let bytes = write_multi_rg_file(
+            &[batch_a],
+            Vec::new(),
+            default_sorting_cols(&arrow_schema),
+            40,
+        );
+
+        let source = CountingInMemorySource::new(bytes);
+        let result = LegacyInputAdapter::try_open(source, dummy_path(), 1).await;
+        let Err(err) = result else {
+            panic!("missing sort_fields must surface as PrefixUnresolvable, got Ok(...)");
+        };
+        match err {
+            LegacyAdapterError::PrefixUnresolvable { target, reason } => {
+                assert_eq!(target, 1);
+                assert!(
+                    reason.contains("sort_fields"),
+                    "reason should mention sort_fields KV; got: {reason}",
+                );
+            }
+            other => panic!("expected PrefixUnresolvable, got: {other}"),
+        }
+    }
+
+    /// `target_prefix_len > declared sort cols` must also bail with
+    /// `PrefixUnresolvable`. Confirms the caller-driven negotiation
+    /// contract: the adapter never silently advertises a smaller
+    /// alignment than asked for.
+    #[tokio::test]
+    async fn test_target_prefix_len_exceeds_declared_sort_cols_returns_unresolvable() {
+        // Two-col sort schema → ask for prefix_len = 3 → bail.
+        let metrics = [("cpu.usage", 30usize), ("memory.used", 30)];
+        let bytes =
+            write_sorted_multi_rg_legacy_file(&metrics, "metric_name|-timestamp_secs/V2", 30);
+
+        let source = CountingInMemorySource::new(bytes);
+        let result = LegacyInputAdapter::try_open(source, dummy_path(), 3).await;
+        let Err(err) = result else {
+            panic!("prefix_len exceeding declared sort cols must surface as Unresolvable");
+        };
+        match err {
+            LegacyAdapterError::PrefixUnresolvable { target, reason } => {
+                assert_eq!(target, 3);
+                assert!(
+                    reason.contains("declares only"),
+                    "reason should mention sort col count; got: {reason}",
+                );
+            }
+            other => panic!("expected PrefixUnresolvable, got: {other}"),
+        }
+    }
+
+    /// Composite prefix (`target_prefix_len = 2`): each output RG
+    /// carries a unique `(metric_name, service)` tuple. Exercises the
+    /// N > 1 path of `compute_prefix_value_slices` and confirms the
+    /// stamped KV reflects the caller's request.
+    #[tokio::test]
+    async fn test_target_prefix_len_two_splits_by_metric_and_service() {
+        // 4 (metric, service) groups × 20 rows; sorted ascending by
+        // (metric_name, service). Multi-RG input layout (rows_per_rg=25)
+        // forces consolidation before re-splitting.
+        let groups = [
+            ("cpu.usage", "api", 20usize),
+            ("cpu.usage", "db", 20),
+            ("memory.used", "api", 20),
+            ("memory.used", "cache", 20),
+        ];
+        let bytes = write_sorted_composite_legacy_file(
+            &groups,
+            "metric_name|service|-timestamp_secs/V2",
+            25,
+        );
+
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 2)
+            .await
+            .expect("adapter open with prefix_len=2");
+
+        // 4 distinct (metric_name, service) tuples → 4 output RGs.
+        assert_eq!(
+            adapter.metadata().num_row_groups(),
+            4,
+            "composite prefix must split at (metric, service) transitions",
+        );
+        let rg_rows: Vec<i64> = (0..adapter.metadata().num_row_groups())
+            .map(|i| adapter.metadata().row_group(i).num_rows())
+            .collect();
+        assert_eq!(rg_rows, vec![20, 20, 20, 20]);
+
+        let prefix_kv = adapter
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .and_then(|kvs| {
+                kvs.iter()
+                    .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                    .and_then(|k| k.value.clone())
+            });
+        assert_eq!(
+            prefix_kv.as_deref(),
+            Some("2"),
+            "stamped prefix_len must match caller's request",
+        );
+
+        // F9 chunk-level verification: a `compute_prefix_value_slices`
+        // bug splitting on only the first prefix col (or off by one)
+        // would still yield 4 RGs of [20,20,20,20] but with the wrong
+        // CONTENTS. PA-1 + PA-3 on the composite (metric, service)
+        // composite key verifies content alignment directly.
+        crate::merge::streaming::region_grouping::assert_unique_rg_prefix_keys(
+            adapter.metadata(),
+            "metric_name|service|-timestamp_secs/V2",
+            2,
+            "test_target_prefix_len_two_splits_by_metric_and_service adapter output",
+        )
+        .expect("composite prefix output must satisfy PA-1 + PA-3");
+    }
+
+    /// SS-3: a sort column named in `qh.sort_fields` but missing from
+    /// the arrow schema is treated as implicitly null at every row.
+    /// Null at every row is constant, so the column trivially
+    /// satisfies any prefix-alignment claim involving it — the
+    /// adapter must NOT bail with `PrefixUnresolvable` in this case.
+    /// Transitions are driven purely by the columns that ARE present.
+    ///
+    /// Fixture: sort_fields declares `metric_name|env|-timestamp_secs`
+    /// but the schema doesn't contain `env`. With prefix_len = 2 the
+    /// adapter must succeed, split only at `metric_name` transitions
+    /// (the null `env` contributes no transitions), and stamp
+    /// `prefix_len = 2` on the output.
+    #[tokio::test]
+    async fn test_missing_prefix_col_treated_as_null_satisfies_alignment() {
+        let metrics = [("cpu.usage", 30usize), ("memory.used", 30)];
+        // Sort schema declares 3 cols; the fixture schema has
+        // metric_name and timestamp_secs but NO `env` column. Per
+        // SS-3 the merge engine treats `env` as null at every row.
+        let bytes =
+            write_sorted_multi_rg_legacy_file(&metrics, "metric_name|env|-timestamp_secs/V2", 30);
+
+        let source = CountingInMemorySource::new(bytes);
+        let adapter = LegacyInputAdapter::try_open(source, dummy_path(), 2)
+            .await
+            .expect("missing-col-as-null must satisfy alignment without erroring");
+
+        // Two metrics × constant null `env` → two output RGs (one per metric).
+        assert_eq!(
+            adapter.metadata().num_row_groups(),
+            2,
+            "missing prefix col (treated as null) contributes no transitions; only metric_name \
+             transitions split RGs",
+        );
+        let rg_rows: Vec<i64> = (0..adapter.metadata().num_row_groups())
+            .map(|i| adapter.metadata().row_group(i).num_rows())
+            .collect();
+        assert_eq!(rg_rows, vec![30, 30]);
+
+        let prefix_kv = adapter
+            .metadata()
+            .file_metadata()
+            .key_value_metadata()
+            .and_then(|kvs| {
+                kvs.iter()
+                    .find(|k| k.key == PARQUET_META_RG_PARTITION_PREFIX_LEN)
+                    .and_then(|k| k.value.clone())
+            });
+        assert_eq!(
+            prefix_kv.as_deref(),
+            Some("2"),
+            "stamped prefix_len must match caller's request even when one col is implicitly null",
+        );
+
+        // SS-3 consumer-side verification: the file the adapter just
+        // produced must be consumable by the merge engine's
+        // `extract_rg_composite_prefix_key` reader. With `env` absent
+        // from the parquet schema, `find_prefix_parquet_col_indices`
+        // returns `None` in that slot and the composite-key extractor
+        // skips it. PA-1 + PA-3 still hold: each RG's metric_name
+        // min == max, and skipping the constant-null `env` slot makes
+        // the RG composite keys differ only by metric_name.
+        crate::merge::streaming::region_grouping::assert_unique_rg_prefix_keys(
+            adapter.metadata(),
+            "metric_name|env|-timestamp_secs/V2",
+            2,
+            "test_missing_prefix_col_treated_as_null_satisfies_alignment adapter output",
+        )
+        .expect("SS-3 null col must satisfy PA-1 + PA-3 (null is constant across all RGs)");
+    }
+
+    /// F8 regression: an unsorted legacy input (rows
+    /// `[A,A,B,B,A,A]` on `metric_name`) violates SS-1. Walking
+    /// row-by-row to find prefix transitions would emit three slices —
+    /// `A`, `B`, `A` — and synthesize a file with two RGs sharing the
+    /// prefix value `A`, violating PA-3. The downstream streaming
+    /// merge engine would catch this later, but only once the bad
+    /// file had been built and possibly archived. The adapter must
+    /// bail upfront with `InputNotSorted` so no PA-3-violating file
+    /// ever lands on disk.
+    #[tokio::test]
+    async fn test_unsorted_legacy_input_rejected_by_adapter() {
+        // metric_name in row order: cpu.usage, memory.used, cpu.usage.
+        // That's an SS-1 violation under sort schema `metric_name ASC`.
+        let bad_metrics = [
+            ("cpu.usage", 20usize),
+            ("memory.used", 20),
+            ("cpu.usage", 20),
+        ];
+        let bytes =
+            write_sorted_multi_rg_legacy_file(&bad_metrics, "metric_name|-timestamp_secs/V2", 20);
+
+        let source = CountingInMemorySource::new(bytes);
+        let result = LegacyInputAdapter::try_open(source, dummy_path(), 1).await;
+        let Err(err) = result else {
+            panic!(
+                "unsorted legacy input must surface as InputNotSorted, got Ok(...) — the adapter \
+                 would have written a PA-3-violating file"
+            );
+        };
+        match err {
+            LegacyAdapterError::InputNotSorted {
+                target,
+                first_offset,
+                second_offset,
+            } => {
+                assert_eq!(target, 1);
+                // First `cpu.usage` run is at offset 0; second is at
+                // offset 40 (after the 20-row `cpu.usage` then 20-row
+                // `memory.used` runs).
+                assert_eq!(
+                    first_offset, 0,
+                    "first duplicate prefix offset should point at the first cpu.usage run",
+                );
+                assert_eq!(
+                    second_offset, 40,
+                    "second duplicate prefix offset should point at the second cpu.usage run",
+                );
+            }
+            other => panic!("expected InputNotSorted, got: {other}"),
+        }
+    }
+
+    /// Composite-prefix fixture: rows grouped by `(metric, service)`
+    /// in the order supplied. Used by the prefix_len=2 test to verify
+    /// transitions on the second prefix column trigger RG splits.
+    fn write_sorted_composite_legacy_file(
+        groups: &[(&str, &str, usize)],
+        sort_fields_value: &str,
+        rows_per_rg: usize,
+    ) -> Bytes {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            Field::new("metric_name", dict_type.clone(), false),
+            Field::new("metric_type", DataType::UInt8, false),
+            Field::new("timestamp_secs", DataType::UInt64, false),
+            Field::new("value", DataType::Float64, false),
+            Field::new("timeseries_id", DataType::Int64, false),
+            Field::new("service", dict_type, true),
+        ]));
+
+        // Build per-group dictionary index tables for metric_name and
+        // service. Map each distinct value to its key.
+        let mut metric_names_vec: Vec<&str> = Vec::new();
+        let mut service_values_vec: Vec<&str> = Vec::new();
+        for (metric, service, _) in groups {
+            if !metric_names_vec.contains(metric) {
+                metric_names_vec.push(metric);
+            }
+            if !service_values_vec.contains(service) {
+                service_values_vec.push(service);
+            }
+        }
+
+        let total: usize = groups.iter().map(|(_, _, n)| *n).sum();
+        let mut metric_keys: Vec<i32> = Vec::with_capacity(total);
+        let mut svc_keys: Vec<Option<i32>> = Vec::with_capacity(total);
+        let mut tsids: Vec<i64> = Vec::with_capacity(total);
+        let mut timestamps: Vec<u64> = Vec::with_capacity(total);
+        let mut values: Vec<f64> = Vec::with_capacity(total);
+        let mut row_idx: u64 = 0;
+        for (metric, service, count) in groups {
+            let metric_key = metric_names_vec
+                .iter()
+                .position(|m| m == metric)
+                .expect("metric known") as i32;
+            let svc_key = service_values_vec
+                .iter()
+                .position(|s| s == service)
+                .expect("service known") as i32;
+            for _ in 0..*count {
+                metric_keys.push(metric_key);
+                svc_keys.push(Some(svc_key));
+                tsids.push(1000 + row_idx as i64);
+                timestamps.push(1_700_000_000 + (*count as u64) - (row_idx % *count as u64));
+                values.push(row_idx as f64);
+                row_idx += 1;
+            }
+        }
+
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(StringArray::from(metric_names_vec)),
+            )
+            .expect("metric dict"),
+        );
+        let service: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(svc_keys),
+                Arc::new(StringArray::from(service_values_vec)),
+            )
+            .expect("svc dict"),
+        );
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; total]));
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+
+        let batch = RecordBatch::try_new(
+            schema.clone(),
+            vec![
+                metric_name,
+                metric_type,
+                timestamp_secs,
+                value,
+                timeseries_id,
+                service,
+            ],
+        )
+        .expect("composite fixture batch");
+
+        let kvs = vec![KeyValue::new(
+            PARQUET_META_SORT_FIELDS.to_string(),
+            sort_fields_value.to_string(),
+        )];
+        let sorting_cols = default_sorting_cols(&schema);
+        write_multi_rg_file(&[batch], kvs, sorting_cols, rows_per_rg)
+    }
+}
diff --git a/quickwit/quickwit-parquet-engine/src/storage/mod.rs b/quickwit/quickwit-parquet-engine/src/storage/mod.rs
index 43fe5861bca..aa173b92ea4 100644
--- a/quickwit/quickwit-parquet-engine/src/storage/mod.rs
+++ b/quickwit/quickwit-parquet-engine/src/storage/mod.rs
@@ -16,6 +16,8 @@
 
 mod config;
 pub mod inspect;
+pub(crate) mod legacy_adapter;
+pub(crate) mod page_decoder;
 pub(crate) mod split_writer;
 pub(crate) mod streaming_reader;
 pub(crate) mod streaming_writer;
@@ -26,7 +28,13 @@ pub use inspect::{
     ColumnReport, PageReport, ParquetPageStatsReport, RowGroupReport, inspect_parquet_page_stats,
     verify_partition_prefix,
 };
+pub use legacy_adapter::{LegacyAdapterError, LegacyInputAdapter};
+pub use page_decoder::{PageDecodeError, StreamDecoder};
 pub use split_writer::ParquetSplitWriter;
+pub use streaming_reader::{
+    ColumnPageStream, Page, ParquetReadError, RemoteByteSource, StreamingParquetReader,
+    StreamingReaderConfig,
+};
 // Re-export metadata constants for use by the merge module and tests.
 pub(crate) use writer::{
     PARQUET_META_NUM_MERGE_OPS, PARQUET_META_RG_PARTITION_PREFIX_LEN, PARQUET_META_ROW_KEYS,
diff --git a/quickwit/quickwit-parquet-engine/src/storage/page_decoder.rs b/quickwit/quickwit-parquet-engine/src/storage/page_decoder.rs
new file mode 100644
index 00000000000..e8c16a93d9d
--- /dev/null
+++ b/quickwit/quickwit-parquet-engine/src/storage/page_decoder.rs
@@ -0,0 +1,2183 @@
+// Copyright 2021-Present Datadog, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! Page-stream → Arrow `ArrayRef` decoder, one input page at a time.
+//!
+//! Each [`StreamDecoder::decode_next_page`] call pulls one [`Page`] from
+//! the underlying [`ColumnPageStream`] and (for data pages) emits a
+//! [`DecodedPage`] carrying the Arrow array for that page's rows. Memory
+//! is bounded by:
+//!
+//! - **One in-flight page** (compressed + decompressed bytes during the current decode).
+//! - **One cached dictionary page** per (rg, col) when the column is dictionary-encoded — needed to
+//!   decode subsequent data pages that reference it. Dict pages are typically small relative to
+//!   data.
+//! - **One [`ColumnReader`] per (rg, col)** holding small internal bookkeeping (level decoders,
+//!   value decoder). The reader holds the current page during decode; we feed pages one at a time,
+//!   so it never holds more than one data page at a time.
+//!
+//! The decoder does **not** buffer a row group, a column chunk, or any
+//! materialised array beyond the one currently being emitted. PR-6b's
+//! merge engine takes the emitted [`DecodedPage`]s in storage order
+//! (row-group-major, column-major-within-rg, page-major-within-col),
+//! consults sort columns to compute the local merge order for each RG,
+//! and streams take-applied output pages directly into the writer.
+//!
+//! # How it works
+//!
+//! 1. Pull one [`Page`] from the stream. Skip `INDEX_PAGE` (never emitted by production writers;
+//!    the variant exists in the Thrift schema for completeness).
+//! 2. Look up (or initialise) per-(rg, col) state: a `PageQueue` that feeds a parquet-rs
+//!    [`ColumnReader`] one page at a time, plus a counter tracking how many rows of this column
+//!    we've decoded.
+//! 3. Convert our [`Page`] (raw compressed bytes + parsed Thrift header) to parquet-rs's
+//!    [`column::page::Page`] enum: decompress with [`create_codec`], translate `format::Encoding`
+//!    (Thrift wrapper) to `basic::Encoding`, drop optional statistics (not needed for decoding
+//!    values).
+//! 4. Push the converted page onto the queue. If it was a dictionary or index page, loop back to
+//!    step 1 — those don't yield rows.
+//! 5. For a data page: ask the [`ColumnReader`] to decode exactly `header.num_values` records. The
+//!    reader pulls the queued data page (plus the cached dict if not yet consumed), decodes values
+//!    + def/rep levels into typed buffers, and returns the count.
+//! 6. Build an `ArrayRef` from `(values, def_levels, rep_levels)` per the column's parquet physical
+//!    type. Emit [`DecodedPage`].
+//!
+//! [`ColumnPageStream`]: super::streaming_reader::ColumnPageStream
+//! [`Page`]: super::streaming_reader::Page
+//! [`ColumnReader`]: parquet::column::reader::ColumnReader
+//! [`column::page::Page`]: parquet::column::page::Page
+//! [`create_codec`]: parquet::compression::create_codec
+
+#![allow(dead_code)]
+#![allow(deprecated)]
+
+use std::collections::{HashMap, VecDeque};
+use std::sync::{Arc, Mutex};
+
+use arrow::array::{
+    Array, ArrayRef, BinaryArray, BooleanArray, Float32Array, Float64Array, Int16Array, Int32Array,
+    Int64Array, LargeBinaryArray, LargeListArray, LargeStringArray, ListArray, PrimitiveArray,
+    StringArray, UInt8Array, UInt16Array, UInt32Array, UInt64Array,
+};
+use arrow::buffer::{NullBuffer, OffsetBuffer, ScalarBuffer};
+use arrow::datatypes::{
+    ArrowPrimitiveType, DataType, Field, Float32Type, Float64Type, Int8Type, Int16Type,
+    Int32Type as ArrowInt32Type, Int64Type as ArrowInt64Type, UInt8Type, UInt16Type, UInt32Type,
+    UInt64Type,
+};
+use bytes::Bytes;
+use parquet::basic::{Encoding as BasicEncoding, Type as PhysicalType};
+use parquet::column::page::Page as ColumnPage;
+use parquet::column::reader::{ColumnReader, get_column_reader};
+use parquet::compression::{Codec, CodecOptions, create_codec};
+use parquet::data_type::{
+    BoolType, ByteArray, ByteArrayType, DataType as ParquetDataType, DoubleType, FloatType,
+    Int32Type, Int64Type,
+};
+use parquet::errors::ParquetError;
+use parquet::file::metadata::ParquetMetaData;
+use parquet::format::{PageHeader, PageType};
+use parquet::schema::types::ColumnDescPtr;
+use thiserror::Error;
+
+use super::streaming_reader::{ColumnPageStream, Page, ParquetReadError};
+
+/// Errors from [`StreamDecoder`].
+///
+/// Each variant carries the underlying source via `#[from]` / `#[source]`
+/// so I/O failures from the page stream surface as
+/// [`PageDecodeError::PageStream`] rather than being masked as decode
+/// errors — the same policy [`ParquetReadError`] already enforces.
+#[derive(Error, Debug)]
+pub enum PageDecodeError {
+    /// The underlying [`ColumnPageStream`] returned an error.
+    #[error("page stream error: {0}")]
+    PageStream(#[from] ParquetReadError),
+
+    /// Parquet decode error from the column reader or page decompression.
+    #[error("parquet decode error: {0}")]
+    Parquet(#[from] ParquetError),
+
+    /// Arrow array-construction error.
+    #[error("arrow build error: {0}")]
+    Arrow(#[from] arrow::error::ArrowError),
+
+    /// A page header carried a field type we don't understand (e.g., an
+    /// `Encoding` value the parquet spec didn't define when this code
+    /// was written).
+    #[error("unsupported encoding value {encoding} on page at ({rg_idx}, {col_idx})")]
+    UnsupportedEncoding {
+        rg_idx: usize,
+        col_idx: usize,
+        encoding: i32,
+    },
+
+    /// A column's parquet physical type pairs with an Arrow type we
+    /// don't construct from raw values (e.g., decimals, FIXED_LEN_BYTE_ARRAY
+    /// outside the supported set).
+    #[error(
+        "field '{name}' has unsupported parquet physical type / arrow type pairing: \
+         physical={physical:?}, arrow={arrow:?}"
+    )]
+    UnsupportedColumnType {
+        name: String,
+        physical: PhysicalType,
+        arrow: DataType,
+    },
+}
+
+/// One decoded data page yielded by [`StreamDecoder::decode_next_page`].
+#[derive(Debug)]
+pub struct DecodedPage {
+    /// Row group this page belongs to.
+    pub rg_idx: usize,
+    /// Column chunk this page belongs to (within the row group).
+    pub col_idx: usize,
+    /// Index of this data page within its column chunk (0-based,
+    /// counting data pages only — dictionary pages do not increment).
+    pub page_idx_in_col: usize,
+    /// Cumulative row offset for `(rg_idx, col_idx)` *before* this
+    /// page. Together with `array.len()` this gives the row range
+    /// `row_start..row_start + array.len()` that this page covers,
+    /// which PR-6b's merge engine uses to slice take indices per page.
+    pub row_start: usize,
+    /// Decoded Arrow array. Length equals the number of records this
+    /// page contributes (i.e. `header.num_values` for the data page).
+    pub array: ArrayRef,
+}
+
+/// Drains a [`ColumnPageStream`] one *page* at a time and emits Arrow
+/// arrays. Caller drives via [`Self::decode_next_page`] until it returns
+/// `Ok(None)` (idempotent EOF).
+///
+/// Memory is bounded by ~one in-flight page per decoder, plus one
+/// cached dictionary page per (rg, col) for dictionary-encoded columns.
+/// Does not buffer the row group.
+///
+/// The decoder accepts the underlying stream either by mutable borrow
+/// (via [`Self::new`]) or by ownership (via [`Self::from_owned`]).
+/// Use the owned form when the decoder must persist across multiple
+/// callers — its per-(rg, col) state (`rows_decoded`, cached pages,
+/// `ColumnReader` instances) MUST live as long as the column is being
+/// consumed; recreating a decoder mid-column discards the
+/// `rows_decoded` counter (so subsequent pages report `row_start = 0`)
+/// and the cached dictionary page (so subsequent data pages can't be
+/// decoded). See the streaming merge engine for an example that holds
+/// the decoder across phases 0 → 3.
+///
+/// # Single-consumer invariant
+///
+/// [`Self::decode_next_page`] takes `&mut self`, so only one task ever
+/// observes or mutates `self.columns`. The per-(rg, col) queue is
+/// wrapped in `Arc<Mutex<_>>` (see `ColumnState::queue`) *not* to
+/// guard against concurrent consumers, but because parquet-rs's
+/// [`PageReader`] trait requires `Send` + interior mutability so the
+/// queue handle can live both here (we push) and inside the boxed
+/// `dyn PageReader` the [`ColumnReader`] consumes from (it pops). All
+/// pushes (`route_page_to_queue`) happen synchronously from
+/// `decode_next_page`; all pops happen synchronously from inside
+/// `decode_one_data_page_into_array` driven by the same call.
+/// Consequently every lock/unlock/lock sequence within the decoder
+/// operates against a queue that no other thread is touching, and
+/// captured fields (e.g. "front data page's `num_values`") remain valid
+/// across the gap.
+///
+/// [`PageReader`]: parquet::column::page::PageReader
+pub struct StreamDecoder<'a> {
+    stream: StreamSource<'a>,
+    metadata: Arc<ParquetMetaData>,
+    columns: HashMap<(usize, usize), ColumnState>,
+    eof: bool,
+}
+
+/// Holds the underlying [`ColumnPageStream`] either by mutable borrow
+/// or by ownership. The borrowed form is the original short-lived
+/// pattern used by unit tests; the owned form lets a [`StreamDecoder`]
+/// outlive its construction scope (required to preserve per-column
+/// state across multi-page / multi-consumer passes).
+enum StreamSource<'a> {
+    Borrowed(&'a mut dyn ColumnPageStream),
+    Owned(Box<dyn ColumnPageStream>),
+}
+
+impl StreamSource<'_> {
+    fn as_mut(&mut self) -> &mut dyn ColumnPageStream {
+        match self {
+            StreamSource::Borrowed(s) => *s,
+            StreamSource::Owned(b) => b.as_mut(),
+        }
+    }
+
+    fn metadata(&self) -> &Arc<ParquetMetaData> {
+        match self {
+            StreamSource::Borrowed(s) => s.metadata(),
+            StreamSource::Owned(b) => b.metadata(),
+        }
+    }
+}
+
+/// Per-(rg, col) state. Holds the [`ColumnReader`] that owns the
+/// page-decoder pipeline, plus a handle to the `PageQueue` we push
+/// converted pages into. The same `Arc<Mutex<...>>` queue lives both
+/// here (so we can push) and inside the `Box<dyn PageReader>` the
+/// `ColumnReader` consumes from (so it can pop).
+struct ColumnState {
+    queue: Arc<Mutex<VecDeque<ColumnPage>>>,
+    reader: ColumnReader,
+    rows_decoded: usize,
+    next_data_page_idx: usize,
+    field: Arc<Field>,
+}
+
+impl<'a> StreamDecoder<'a> {
+    pub fn new(stream: &'a mut dyn ColumnPageStream) -> Self {
+        let metadata = Arc::clone(stream.metadata());
+        Self {
+            stream: StreamSource::Borrowed(stream),
+            metadata,
+            columns: HashMap::new(),
+            eof: false,
+        }
+    }
+
+    /// Pull and decode the next data page in storage order. Dictionary
+    /// pages are absorbed silently (fed to the column reader for use by
+    /// subsequent data pages). Returns `Ok(None)` at EOF.
+    ///
+    /// Maintains a **one-page lookahead** in the per-(rg, col) queues:
+    /// after the current data page is queued, one more page is pulled
+    /// from the stream and routed to its queue *before* `read_records`
+    /// runs. This makes `PageQueueReader::peek_next_page` return
+    /// accurate next-page metadata when parquet-rs's column reader
+    /// calls `at_record_boundary()` — required for V1 data pages with
+    /// repetition levels, where a list record can continue onto the
+    /// next page and parquet-rs needs the next page's metadata to
+    /// decide whether to flush partial rep-level state. Without the
+    /// lookahead, peek returns `None` at every page end and parquet-rs
+    /// treats it as the last page, which can split a list incorrectly.
+    pub async fn decode_next_page(&mut self) -> Result<Option<DecodedPage>, PageDecodeError> {
+        loop {
+            // Prefer a state whose queue already has an unconsumed data
+            // page (left over from a previous call's lookahead, or
+            // queued during the loop below). At most one state has
+            // an unconsumed queued data page at any time, since each
+            // call pre-fetches exactly one page.
+            if let Some((key, num_values_in_page)) = self.next_decodable_head() {
+                return self.decode_head(key, num_values_in_page).await;
+            }
+
+            if self.eof {
+                return Ok(None);
+            }
+
+            match self.stream.as_mut().next_page().await? {
+                Some(page) => self.route_page_to_queue(page)?,
+                None => {
+                    self.eof = true;
+                    // Loop once more to flush any state that may have
+                    // a queued data page from a prior call's lookahead.
+                    if self.next_decodable_head().is_none() {
+                        return Ok(None);
+                    }
+                }
+            }
+        }
+    }
+
+    /// Find the next (rg, col) whose queue contains an unconsumed data
+    /// page, and capture that page's `num_values` count in the same
+    /// lock acquisition.
+    ///
+    /// Returning `((rg, col), num_values)` together — rather than the
+    /// key alone followed by a second `state.queue.lock()` in a
+    /// downstream method — collapses what would otherwise look like a
+    /// TOCTOU shape (lock-find-unlock, then re-lock and re-find) into
+    /// a single peek. The single-consumer invariant on `StreamDecoder`
+    /// (see type docs) already guarantees the second lookup would
+    /// observe the same front data page, but capturing both fields
+    /// under one lock makes that guarantee visible at the call site
+    /// instead of forcing the reader to reconstruct it globally.
+    fn next_decodable_head(&self) -> Option<((usize, usize), usize)> {
+        for (&key, state) in self.columns.iter() {
+            let q = state.queue.lock().expect("PageQueue mutex poisoned");
+            let num_values = q.iter().find_map(|p| match p {
+                ColumnPage::DataPage { num_values, .. }
+                | ColumnPage::DataPageV2 { num_values, .. } => Some(*num_values as usize),
+                _ => None,
+            });
+            if let Some(n) = num_values {
+                return Some((key, n));
+            }
+        }
+        None
+    }
+
+    /// Decode the front data page on `key`'s state's queue, consuming
+    /// the `num_values_in_page` records captured by
+    /// [`Self::next_decodable_head`].
+    ///
+    /// For columns with repetition (max_rep_level > 0, i.e. List<T> /
+    /// LargeList<T>), pre-fetches one more page from the stream before
+    /// driving `read_records` so `PageQueueReader::peek_next_page`
+    /// returns accurate metadata when parquet-rs's column reader
+    /// checks `at_record_boundary` for V1 record continuation. The
+    /// pre-fetch only `push_back`s onto a queue; it cannot displace
+    /// the front data page that `num_values_in_page` was captured
+    /// from, so the captured value remains valid across the await.
+    ///
+    /// For flat columns (max_rep_level == 0), the pre-fetch is skipped:
+    /// flat values have no record-spanning concern (each value = one
+    /// record), and pre-fetching would advance the underlying stream
+    /// past the current column chunk — which is unsafe if a caller
+    /// drops this decoder mid-traversal (e.g., the merge engine's
+    /// phase 0 sort-col drain followed by a separate phase 3 body-col
+    /// streaming pass that constructs a fresh decoder over the same
+    /// `ColumnPageStream`). The lookahead's only benefit is the V1
+    /// list-record-spanning correctness, which doesn't apply to flat
+    /// columns.
+    async fn decode_head(
+        &mut self,
+        key: (usize, usize),
+        num_values_in_page: usize,
+    ) -> Result<Option<DecodedPage>, PageDecodeError> {
+        // Decide whether the current column needs the lookahead.
+        let needs_lookahead = {
+            let parquet_schema = self.metadata.file_metadata().schema_descr();
+            let col_descr = parquet_schema.column(key.1);
+            col_descr.max_rep_level() > 0
+        };
+
+        if needs_lookahead && !self.eof {
+            match self.stream.as_mut().next_page().await? {
+                Some(page) => self.route_page_to_queue(page)?,
+                None => self.eof = true,
+            }
+        }
+
+        let state = self
+            .columns
+            .get_mut(&key)
+            .expect("caller's key must have a state");
+        let array = decode_one_data_page_into_array(state, num_values_in_page)?;
+        let row_start = state.rows_decoded;
+        let page_idx_in_col = state.next_data_page_idx;
+        state.rows_decoded += array.len();
+        state.next_data_page_idx += 1;
+
+        Ok(Some(DecodedPage {
+            rg_idx: key.0,
+            col_idx: key.1,
+            page_idx_in_col,
+            row_start,
+            array,
+        }))
+    }
+
+    /// Convert a raw stream `Page` to a parquet-rs `ColumnPage` and
+    /// push it onto the appropriate (rg, col) state's queue. Skips
+    /// `INDEX_PAGE` defensively (no production writer emits it).
+    fn route_page_to_queue(&mut self, page: Page) -> Result<(), PageDecodeError> {
+        if page.header.type_ == PageType::INDEX_PAGE {
+            return Ok(());
+        }
+        let key = (page.rg_idx, page.col_idx);
+        if !self.columns.contains_key(&key) {
+            let state = init_column_state(&self.metadata, key)?;
+            self.columns.insert(key, state);
+        }
+        let col_meta = self.metadata.row_group(key.0).column(key.1);
+        let physical = col_meta.column_type();
+        let compression = col_meta.compression();
+        let col_page = convert_page(&page, physical, compression, key)?;
+        let state = self.columns.get_mut(&key).expect("just inserted above");
+        state
+            .queue
+            .lock()
+            .expect("PageQueue mutex poisoned")
+            .push_back(col_page);
+        Ok(())
+    }
+
+    /// File metadata. Schema, row-group layout, and KV `qh.*` metadata
+    /// come from here.
+    pub fn metadata(&self) -> &Arc<ParquetMetaData> {
+        &self.metadata
+    }
+}
+
+impl StreamDecoder<'static> {
+    /// Build a decoder that owns its stream. Use this when the decoder
+    /// must outlive the scope that constructed the stream — e.g., when
+    /// the same decoder must be reused across multiple traversals of the
+    /// same column chunk, since reconstructing it would reset the
+    /// per-column `rows_decoded` counter and lose any cached dictionary
+    /// page.
+    pub fn from_owned(stream: Box<dyn ColumnPageStream>) -> Self {
+        let metadata = Arc::clone(stream.metadata());
+        Self {
+            stream: StreamSource::Owned(stream),
+            metadata,
+            columns: HashMap::new(),
+            eof: false,
+        }
+    }
+}
+
+// -------- Per-(rg, col) initialisation --------
+
+fn init_column_state(
+    metadata: &Arc<ParquetMetaData>,
+    key: (usize, usize),
+) -> Result<ColumnState, PageDecodeError> {
+    let (rg_idx, col_idx) = key;
+    let parquet_schema = metadata.file_metadata().schema_descr();
+    let col_descr: ColumnDescPtr = parquet_schema.column(col_idx);
+
+    // Compute the arrow Field for this column. We use
+    // `parquet_to_arrow_schema` over the full schema and pick the
+    // matching top-level field. Most columns will be flat (one leaf
+    // per top-level field), so the col_idx is also the field idx.
+    // Nested columns (List<primitive>) still have one top-level field,
+    // matching one entry in the arrow schema.
+    //
+    // We deliberately pass `None` for kv_metadata so the
+    // `ARROW:schema` hint is ignored — that hint reconstructs the
+    // writer's original Dictionary<...> types, but the parquet
+    // column reader decodes values back to their physical type (Utf8
+    // / Binary for byte-array columns), so a `Dictionary` Arrow field
+    // wouldn't match the array we produce. PR-6b's union schema
+    // normalises strings to Utf8 anyway, and the streaming writer
+    // re-applies dict encoding on output based on observed cardinality.
+    let arrow_schema = parquet::arrow::parquet_to_arrow_schema(parquet_schema, None)?;
+    let field = arrow_schema
+        .fields()
+        .get(col_idx)
+        .ok_or_else(|| {
+            PageDecodeError::Parquet(ParquetError::General(format!(
+                "column index {col_idx} out of range for arrow schema (rg {rg_idx})",
+            )))
+        })?
+        .clone();
+
+    let queue: Arc<Mutex<VecDeque<ColumnPage>>> = Arc::new(Mutex::new(VecDeque::with_capacity(2)));
+    let page_reader: Box<dyn parquet::column::page::PageReader> =
+        Box::new(PageQueueReader::new(Arc::clone(&queue)));
+    let reader = get_column_reader(col_descr, page_reader);
+
+    Ok(ColumnState {
+        queue,
+        reader,
+        rows_decoded: 0,
+        next_data_page_idx: 0,
+        field,
+    })
+}
+
+// -------- PageReader over a shared queue --------
+
+/// Implements [`parquet::column::page::PageReader`] over an
+/// `Arc<Mutex<VecDeque<ColumnPage>>>`. The owning [`StreamDecoder`]
+/// pushes converted pages into the queue; the reader pops them on
+/// demand. When the queue is empty the reader returns `Ok(None)` —
+/// the column reader interprets that as "no more pages for this column
+/// chunk *right now*" and stops mid-decode. Since we always push exactly
+/// one data page at a time and then drive the column reader to decode
+/// `num_values` records (which the reader does in one swoop, draining
+/// the page), the queue is empty between calls and refilled before the
+/// next call.
+struct PageQueueReader {
+    queue: Arc<Mutex<VecDeque<ColumnPage>>>,
+}
+
+impl PageQueueReader {
+    fn new(queue: Arc<Mutex<VecDeque<ColumnPage>>>) -> Self {
+        Self { queue }
+    }
+}
+
+impl Iterator for PageQueueReader {
+    type Item = parquet::errors::Result<ColumnPage>;
+
+    fn next(&mut self) -> Option<Self::Item> {
+        let popped = self
+            .queue
+            .lock()
+            .expect("PageQueue mutex poisoned")
+            .pop_front();
+        popped.map(Ok)
+    }
+}
+
+impl parquet::column::page::PageReader for PageQueueReader {
+    fn get_next_page(&mut self) -> parquet::errors::Result<Option<ColumnPage>> {
+        Ok(self
+            .queue
+            .lock()
+            .expect("PageQueue mutex poisoned")
+            .pop_front())
+    }
+
+    fn peek_next_page(
+        &mut self,
+    ) -> parquet::errors::Result<Option<parquet::column::page::PageMetadata>> {
+        // Used by the rep-level decoder (`at_record_boundary`) to know
+        // whether the next page begins a new record. We build the
+        // metadata directly from the front of our queue.
+        let guard = self.queue.lock().expect("PageQueue mutex poisoned");
+        Ok(guard.front().map(page_metadata_from_column_page))
+    }
+
+    fn skip_next_page(&mut self) -> parquet::errors::Result<()> {
+        let mut guard = self.queue.lock().expect("PageQueue mutex poisoned");
+        guard.pop_front();
+        Ok(())
+    }
+}
+
+/// Build a [`parquet::column::page::PageMetadata`] from a decoded
+/// [`ColumnPage`]. Mirrors the shape of parquet-rs's
+/// `TryFrom<&PageHeader> for PageMetadata` for the variants we use.
+fn page_metadata_from_column_page(p: &ColumnPage) -> parquet::column::page::PageMetadata {
+    match p {
+        ColumnPage::DataPage { num_values, .. } => parquet::column::page::PageMetadata {
+            num_rows: None,
+            num_levels: Some(*num_values as usize),
+            is_dict: false,
+        },
+        ColumnPage::DataPageV2 {
+            num_values,
+            num_rows,
+            ..
+        } => parquet::column::page::PageMetadata {
+            num_rows: Some(*num_rows as usize),
+            num_levels: Some(*num_values as usize),
+            is_dict: false,
+        },
+        ColumnPage::DictionaryPage { .. } => parquet::column::page::PageMetadata {
+            num_rows: None,
+            num_levels: None,
+            is_dict: true,
+        },
+    }
+}
+
+// -------- Page conversion (our format::Page → column::page::Page) --------
+
+/// Convert our [`Page`] to parquet-rs's [`ColumnPage`] enum, decompressing
+/// the body bytes where applicable.
+fn convert_page(
+    page: &Page,
+    physical: PhysicalType,
+    compression: parquet::basic::Compression,
+    key: (usize, usize),
+) -> Result<ColumnPage, PageDecodeError> {
+    let header: &PageHeader = &page.header;
+    let raw = &page.bytes;
+
+    // For DATA_PAGE_V2, def + rep levels precede the value bytes and
+    // are NOT compressed. Only the suffix beyond the levels is
+    // compressed (and only when `is_compressed` is true).
+    let mut levels_prefix_len = 0usize;
+    let mut can_decompress = true;
+    if let Some(v2) = header.data_page_header_v2.as_ref() {
+        if v2.definition_levels_byte_length < 0 || v2.repetition_levels_byte_length < 0 {
+            return Err(PageDecodeError::Parquet(ParquetError::General(format!(
+                "DataPageV2 at ({}, {}) has negative level byte lengths",
+                key.0, key.1,
+            ))));
+        }
+        levels_prefix_len =
+            (v2.definition_levels_byte_length + v2.repetition_levels_byte_length) as usize;
+        can_decompress = v2.is_compressed.unwrap_or(true);
+    }
+
+    let body: Bytes = decompress_page_body(
+        raw,
+        compression,
+        header.uncompressed_page_size as usize,
+        levels_prefix_len,
+        can_decompress,
+    )?;
+
+    let _ = physical; // currently unused — kept for future page-type-specific validation
+    let _ = key;
+
+    match header.type_ {
+        PageType::DICTIONARY_PAGE => {
+            let h = header.dictionary_page_header.as_ref().ok_or_else(|| {
+                PageDecodeError::Parquet(ParquetError::General(
+                    "dictionary page header missing".into(),
+                ))
+            })?;
+            Ok(ColumnPage::DictionaryPage {
+                buf: body,
+                num_values: h.num_values as u32,
+                encoding: format_encoding_to_basic(h.encoding, key)?,
+                is_sorted: h.is_sorted.unwrap_or(false),
+            })
+        }
+        PageType::DATA_PAGE => {
+            let h = header.data_page_header.as_ref().ok_or_else(|| {
+                PageDecodeError::Parquet(ParquetError::General("data page header missing".into()))
+            })?;
+            Ok(ColumnPage::DataPage {
+                buf: body,
+                num_values: h.num_values as u32,
+                encoding: format_encoding_to_basic(h.encoding, key)?,
+                def_level_encoding: format_encoding_to_basic(h.definition_level_encoding, key)?,
+                rep_level_encoding: format_encoding_to_basic(h.repetition_level_encoding, key)?,
+                statistics: None,
+            })
+        }
+        PageType::DATA_PAGE_V2 => {
+            let h = header.data_page_header_v2.as_ref().ok_or_else(|| {
+                PageDecodeError::Parquet(ParquetError::General(
+                    "data page v2 header missing".into(),
+                ))
+            })?;
+            Ok(ColumnPage::DataPageV2 {
+                buf: body,
+                num_values: h.num_values as u32,
+                encoding: format_encoding_to_basic(h.encoding, key)?,
+                num_nulls: h.num_nulls as u32,
+                num_rows: h.num_rows as u32,
+                def_levels_byte_len: h.definition_levels_byte_length as u32,
+                rep_levels_byte_len: h.repetition_levels_byte_length as u32,
+                is_compressed: h.is_compressed.unwrap_or(true),
+                statistics: None,
+            })
+        }
+        other => Err(PageDecodeError::Parquet(ParquetError::General(format!(
+            "unexpected page type {other:?} at ({}, {})",
+            key.0, key.1,
+        )))),
+    }
+}
+
+fn decompress_page_body(
+    raw: &Bytes,
+    compression: parquet::basic::Compression,
+    uncompressed_page_size: usize,
+    levels_prefix_len: usize,
+    can_decompress: bool,
+) -> Result<Bytes, PageDecodeError> {
+    if !can_decompress {
+        // DataPageV2 with is_compressed=false: body is already plain.
+        return Ok(raw.clone());
+    }
+    let codec_opt: Option<Box<dyn Codec>> = create_codec(compression, &CodecOptions::default())?;
+    let mut codec: Box<dyn Codec> = match codec_opt {
+        Some(c) => c,
+        None => {
+            // UNCOMPRESSED.
+            return Ok(raw.clone());
+        }
+    };
+
+    if levels_prefix_len > raw.len() || levels_prefix_len > uncompressed_page_size {
+        return Err(PageDecodeError::Parquet(ParquetError::General(format!(
+            "level prefix length {levels_prefix_len} exceeds page bounds",
+        ))));
+    }
+
+    let mut out: Vec<u8> = Vec::with_capacity(uncompressed_page_size);
+    out.extend_from_slice(&raw[..levels_prefix_len]);
+    let values_uncompressed = uncompressed_page_size - levels_prefix_len;
+    if values_uncompressed > 0 {
+        codec.decompress(
+            &raw[levels_prefix_len..],
+            &mut out,
+            Some(values_uncompressed),
+        )?;
+    }
+    if out.len() != uncompressed_page_size {
+        return Err(PageDecodeError::Parquet(ParquetError::General(format!(
+            "decompressed size {} does not match uncompressed_page_size {}",
+            out.len(),
+            uncompressed_page_size,
+        ))));
+    }
+    Ok(Bytes::from(out))
+}
+
+/// Translate the Thrift-wrapped `format::Encoding` (i32) to the
+/// strongly-typed `basic::Encoding` parquet-rs uses for page decoding.
+/// `parquet-rs` doesn't expose a public `From<i32>` so we mirror the
+/// match here.
+fn format_encoding_to_basic(
+    encoding: parquet::format::Encoding,
+    key: (usize, usize),
+) -> Result<BasicEncoding, PageDecodeError> {
+    let v = encoding.0;
+    match v {
+        0 => Ok(BasicEncoding::PLAIN),
+        // 1 is GROUP_VAR_INT, deprecated and never written by arrow-rs.
+        2 => Ok(BasicEncoding::PLAIN_DICTIONARY),
+        3 => Ok(BasicEncoding::RLE),
+        4 => Ok(BasicEncoding::BIT_PACKED),
+        5 => Ok(BasicEncoding::DELTA_BINARY_PACKED),
+        6 => Ok(BasicEncoding::DELTA_LENGTH_BYTE_ARRAY),
+        7 => Ok(BasicEncoding::DELTA_BYTE_ARRAY),
+        8 => Ok(BasicEncoding::RLE_DICTIONARY),
+        9 => Ok(BasicEncoding::BYTE_STREAM_SPLIT),
+        _ => Err(PageDecodeError::UnsupportedEncoding {
+            rg_idx: key.0,
+            col_idx: key.1,
+            encoding: v,
+        }),
+    }
+}
+
+// -------- Decode one data page into an Arrow ArrayRef --------
+
+const READ_BATCH: usize = 4096;
+
+fn decode_one_data_page_into_array(
+    state: &mut ColumnState,
+    num_values: usize,
+) -> Result<ArrayRef, PageDecodeError> {
+    match &mut state.reader {
+        ColumnReader::BoolColumnReader(r) => {
+            let (records, defs, _reps, values) = read_typed::<BoolType>(r, num_values)?;
+            build_bool_array(&state.field, records, &defs, &values)
+        }
+        ColumnReader::Int32ColumnReader(r) => {
+            let (records, defs, reps, values) = read_typed::<Int32Type>(r, num_values)?;
+            build_int32_array(&state.field, records, &defs, &reps, &values)
+        }
+        ColumnReader::Int64ColumnReader(r) => {
+            let (records, defs, reps, values) = read_typed::<Int64Type>(r, num_values)?;
+            build_int64_array(&state.field, records, &defs, &reps, &values)
+        }
+        ColumnReader::FloatColumnReader(r) => {
+            let (records, defs, reps, values) = read_typed::<FloatType>(r, num_values)?;
+            build_float32_array(&state.field, records, &defs, &reps, &values)
+        }
+        ColumnReader::DoubleColumnReader(r) => {
+            let (records, defs, reps, values) = read_typed::<DoubleType>(r, num_values)?;
+            build_float64_array(&state.field, records, &defs, &reps, &values)
+        }
+        ColumnReader::ByteArrayColumnReader(r) => {
+            let (records, defs, _reps, values) = read_typed::<ByteArrayType>(r, num_values)?;
+            build_byte_array(&state.field, records, &defs, &values)
+        }
+        ColumnReader::Int96ColumnReader(_) | ColumnReader::FixedLenByteArrayColumnReader(_) => {
+            Err(PageDecodeError::UnsupportedColumnType {
+                name: state.field.name().to_string(),
+                physical: PhysicalType::INT96,
+                arrow: state.field.data_type().clone(),
+            })
+        }
+    }
+}
+
+/// Read up to `num_values` records out of one typed column reader,
+/// returning `(records_read, def_levels, rep_levels, values)`. The reader
+/// pulls pages from its `PageQueueReader`; since we push exactly one
+/// data page (plus optional dictionary) before each call and the data
+/// page advertises `num_values` records, this single call consumes the
+/// queued page in full.
+/// `(records_read, def_levels, rep_levels, values)`.
+type ReadOutput<T> = (usize, Vec<i16>, Vec<i16>, Vec<<T as ParquetDataType>::T>);
+
+fn read_typed<T>(
+    reader: &mut parquet::column::reader::ColumnReaderImpl<T>,
+    num_values: usize,
+) -> Result<ReadOutput<T>, PageDecodeError>
+where
+    T: ParquetDataType,
+    T::T: Default + Clone,
+{
+    let mut values: Vec<T::T> = Vec::with_capacity(num_values);
+    let mut def_levels: Vec<i16> = Vec::new();
+    let mut rep_levels: Vec<i16> = Vec::new();
+    let mut total_records = 0usize;
+
+    while total_records < num_values {
+        let want = num_values - total_records;
+        let (records, _values_read, _levels_read) = reader.read_records(
+            want.min(READ_BATCH),
+            Some(&mut def_levels),
+            Some(&mut rep_levels),
+            &mut values,
+        )?;
+        if records == 0 {
+            break;
+        }
+        total_records += records;
+    }
+    Ok((total_records, def_levels, rep_levels, values))
+}
+
+// -------- Array builders, parquet physical type → arrow ArrayRef --------
+
+// ----------------------------------------------------------------------
+// Arrow array construction from (def_levels, rep_levels, raw values).
+//
+// The `build_*` family below mirrors
+// `parquet::arrow::array_reader::PrimitiveArrayReader::consume_batch` and
+// the `coerce_i32` / `coerce_i64` widening helpers in parquet-rs, plus
+// `make_byte_array_reader` / `ListArrayReader` for the byte and list
+// branches. We re-implement instead of importing because that module is
+// `#[doc(hidden)]` and gated by parquet-rs's `experimental` Cargo
+// feature, which we don't enable — parquet-rs explicitly reserves the
+// right to break that surface across versions. If `array_reader` is
+// ever stabilised, this whole section becomes a thin adapter.
+//
+// `build_primitive` factors out the def-levels → null-buffer + gather-
+// and-cast → typed-array path shared by every fixed-width numeric
+// builder, mirroring parquet-rs's coercion table.
+// ----------------------------------------------------------------------
+
+fn null_buffer_from_defs(num_records: usize, defs: &[i16], max_def: i16) -> Option<NullBuffer> {
+    if defs.is_empty() || max_def == 0 {
+        return None;
+    }
+    let presence: Vec<bool> = defs
+        .iter()
+        .take(num_records)
+        .map(|d| *d >= max_def)
+        .collect();
+    Some(NullBuffer::from(presence))
+}
+
+fn build_bool_array(
+    field: &Field,
+    records: usize,
+    defs: &[i16],
+    values: &[bool],
+) -> Result<ArrayRef, PageDecodeError> {
+    let max_def = max_def_for(field);
+    let arr = if defs.is_empty() || max_def == 0 {
+        BooleanArray::from(values.to_vec())
+    } else {
+        let mut full: Vec<Option<bool>> = Vec::with_capacity(records);
+        let mut val_idx = 0usize;
+        for d in defs.iter().take(records) {
+            if *d >= max_def {
+                full.push(Some(values[val_idx]));
+                val_idx += 1;
+            } else {
+                full.push(None);
+            }
+        }
+        BooleanArray::from(full)
+    };
+    Ok(Arc::new(arr))
+}
+
+/// Build a nullable Arrow primitive array of marker type `P` from the
+/// physical `values` buffer plus a precomputed null buffer.
+///
+/// Gathers raw values into present slots according to `defs` (using
+/// [`scalar_buffer_from_present`]) and casts each through `cast`. This
+/// is the same shape parquet-rs uses internally — see
+/// `parquet::arrow::array_reader::PrimitiveArrayReader::consume_batch`
+/// plus the `coerce_i32` / `coerce_i64` helpers — but factored out of
+/// each per-physical-type builder so the cast-and-construct lines stay
+/// readable.
+fn build_primitive<P, T, F>(
+    nulls: Option<NullBuffer>,
+    records: usize,
+    defs: &[i16],
+    max_def: i16,
+    values: &[T],
+    cast: F,
+) -> ArrayRef
+where
+    P: ArrowPrimitiveType,
+    T: Copy,
+    F: Fn(T) -> P::Native,
+{
+    let scalars =
+        scalar_buffer_from_present::<T, P::Native, _>(records, defs, max_def, values, cast);
+    Arc::new(PrimitiveArray::<P>::new(scalars, nulls))
+}
+
+fn build_int32_array(
+    field: &Field,
+    records: usize,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[i32],
+) -> Result<ArrayRef, PageDecodeError> {
+    if matches!(
+        field.data_type(),
+        DataType::List(_) | DataType::LargeList(_)
+    ) {
+        return build_list_i32_array(field, defs, reps, values);
+    }
+    let max_def = max_def_for(field);
+    let nulls = null_buffer_from_defs(records, defs, max_def);
+    // Parquet stores u8/u16/u32 as INT32 physical with an unsigned
+    // logical annotation; the on-wire i32 maps to u{8,16,32} by
+    // reinterpreting bits via `as`.
+    let arr = match field.data_type() {
+        DataType::Int8 => {
+            build_primitive::<Int8Type, _, _>(nulls, records, defs, max_def, values, |v| v as i8)
+        }
+        DataType::Int16 => {
+            build_primitive::<Int16Type, _, _>(nulls, records, defs, max_def, values, |v| v as i16)
+        }
+        DataType::Int32 => {
+            build_primitive::<ArrowInt32Type, _, _>(nulls, records, defs, max_def, values, |v| v)
+        }
+        DataType::UInt8 => {
+            build_primitive::<UInt8Type, _, _>(nulls, records, defs, max_def, values, |v| v as u8)
+        }
+        DataType::UInt16 => {
+            build_primitive::<UInt16Type, _, _>(nulls, records, defs, max_def, values, |v| v as u16)
+        }
+        DataType::UInt32 => {
+            build_primitive::<UInt32Type, _, _>(nulls, records, defs, max_def, values, |v| v as u32)
+        }
+        other => {
+            return Err(PageDecodeError::UnsupportedColumnType {
+                name: field.name().to_string(),
+                physical: PhysicalType::INT32,
+                arrow: other.clone(),
+            });
+        }
+    };
+    Ok(arr)
+}
+
+fn build_int64_array(
+    field: &Field,
+    records: usize,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[i64],
+) -> Result<ArrayRef, PageDecodeError> {
+    if matches!(
+        field.data_type(),
+        DataType::List(_) | DataType::LargeList(_)
+    ) {
+        return build_list_i64_array(field, defs, reps, values);
+    }
+    let max_def = max_def_for(field);
+    let nulls = null_buffer_from_defs(records, defs, max_def);
+    let arr = match field.data_type() {
+        DataType::Int64 => {
+            build_primitive::<ArrowInt64Type, _, _>(nulls, records, defs, max_def, values, |v| v)
+        }
+        DataType::UInt64 => {
+            build_primitive::<UInt64Type, _, _>(nulls, records, defs, max_def, values, |v| v as u64)
+        }
+        other => {
+            return Err(PageDecodeError::UnsupportedColumnType {
+                name: field.name().to_string(),
+                physical: PhysicalType::INT64,
+                arrow: other.clone(),
+            });
+        }
+    };
+    Ok(arr)
+}
+
+fn build_float32_array(
+    field: &Field,
+    records: usize,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[f32],
+) -> Result<ArrayRef, PageDecodeError> {
+    if matches!(
+        field.data_type(),
+        DataType::List(_) | DataType::LargeList(_)
+    ) {
+        return build_list_f32_array(field, defs, reps, values);
+    }
+    let max_def = max_def_for(field);
+    let nulls = null_buffer_from_defs(records, defs, max_def);
+    Ok(build_primitive::<Float32Type, _, _>(
+        nulls,
+        records,
+        defs,
+        max_def,
+        values,
+        |v| v,
+    ))
+}
+
+fn build_float64_array(
+    field: &Field,
+    records: usize,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[f64],
+) -> Result<ArrayRef, PageDecodeError> {
+    if matches!(
+        field.data_type(),
+        DataType::List(_) | DataType::LargeList(_)
+    ) {
+        return build_list_f64_array(field, defs, reps, values);
+    }
+    let max_def = max_def_for(field);
+    let nulls = null_buffer_from_defs(records, defs, max_def);
+    Ok(build_primitive::<Float64Type, _, _>(
+        nulls,
+        records,
+        defs,
+        max_def,
+        values,
+        |v| v,
+    ))
+}
+
+fn build_byte_array(
+    field: &Field,
+    records: usize,
+    defs: &[i16],
+    values: &[ByteArray],
+) -> Result<ArrayRef, PageDecodeError> {
+    let max_def = max_def_for(field);
+    match field.data_type() {
+        DataType::Utf8 => Ok(Arc::new(build_string_array(
+            records, defs, max_def, values,
+        )?)),
+        DataType::LargeUtf8 => Ok(Arc::new(build_large_string_array(
+            records, defs, max_def, values,
+        )?)),
+        DataType::Binary => Ok(Arc::new(build_binary_array(records, defs, max_def, values))),
+        DataType::LargeBinary => Ok(Arc::new(build_large_binary_array(
+            records, defs, max_def, values,
+        ))),
+        DataType::Dictionary(_, value_type) => {
+            // Materialise as the value type (Utf8 / Binary); the merge
+            // engine's union schema normalises strings to Utf8 anyway,
+            // and the output writer re-applies dict encoding based on
+            // observed cardinality. Decoding to a Dictionary array
+            // directly would require synthesising keys; not needed.
+            match value_type.as_ref() {
+                DataType::Utf8 => Ok(Arc::new(build_string_array(
+                    records, defs, max_def, values,
+                )?)),
+                DataType::LargeUtf8 => Ok(Arc::new(build_large_string_array(
+                    records, defs, max_def, values,
+                )?)),
+                DataType::Binary => {
+                    Ok(Arc::new(build_binary_array(records, defs, max_def, values)))
+                }
+                DataType::LargeBinary => Ok(Arc::new(build_large_binary_array(
+                    records, defs, max_def, values,
+                ))),
+                _ => Err(PageDecodeError::UnsupportedColumnType {
+                    name: field.name().to_string(),
+                    physical: PhysicalType::BYTE_ARRAY,
+                    arrow: field.data_type().clone(),
+                }),
+            }
+        }
+        other => Err(PageDecodeError::UnsupportedColumnType {
+            name: field.name().to_string(),
+            physical: PhysicalType::BYTE_ARRAY,
+            arrow: other.clone(),
+        }),
+    }
+}
+
+fn build_string_array(
+    records: usize,
+    defs: &[i16],
+    max_def: i16,
+    values: &[ByteArray],
+) -> Result<StringArray, PageDecodeError> {
+    let mut builder = arrow::array::StringBuilder::with_capacity(records, records * 8);
+    if defs.is_empty() || max_def == 0 {
+        for v in values {
+            let s = std::str::from_utf8(v.data())
+                .map_err(|e| PageDecodeError::Parquet(ParquetError::General(e.to_string())))?;
+            builder.append_value(s);
+        }
+    } else {
+        let mut val_idx = 0usize;
+        for d in defs.iter().take(records) {
+            if *d >= max_def {
+                let s = std::str::from_utf8(values[val_idx].data())
+                    .map_err(|e| PageDecodeError::Parquet(ParquetError::General(e.to_string())))?;
+                builder.append_value(s);
+                val_idx += 1;
+            } else {
+                builder.append_null();
+            }
+        }
+    }
+    Ok(builder.finish())
+}
+
+fn build_large_string_array(
+    records: usize,
+    defs: &[i16],
+    max_def: i16,
+    values: &[ByteArray],
+) -> Result<LargeStringArray, PageDecodeError> {
+    let mut builder = arrow::array::LargeStringBuilder::with_capacity(records, records * 8);
+    if defs.is_empty() || max_def == 0 {
+        for v in values {
+            let s = std::str::from_utf8(v.data())
+                .map_err(|e| PageDecodeError::Parquet(ParquetError::General(e.to_string())))?;
+            builder.append_value(s);
+        }
+    } else {
+        let mut val_idx = 0usize;
+        for d in defs.iter().take(records) {
+            if *d >= max_def {
+                let s = std::str::from_utf8(values[val_idx].data())
+                    .map_err(|e| PageDecodeError::Parquet(ParquetError::General(e.to_string())))?;
+                builder.append_value(s);
+                val_idx += 1;
+            } else {
+                builder.append_null();
+            }
+        }
+    }
+    Ok(builder.finish())
+}
+
+fn build_binary_array(
+    records: usize,
+    defs: &[i16],
+    max_def: i16,
+    values: &[ByteArray],
+) -> BinaryArray {
+    let mut builder = arrow::array::BinaryBuilder::with_capacity(records, records * 8);
+    if defs.is_empty() || max_def == 0 {
+        for v in values {
+            builder.append_value(v.data());
+        }
+    } else {
+        let mut val_idx = 0usize;
+        for d in defs.iter().take(records) {
+            if *d >= max_def {
+                builder.append_value(values[val_idx].data());
+                val_idx += 1;
+            } else {
+                builder.append_null();
+            }
+        }
+    }
+    builder.finish()
+}
+
+fn build_large_binary_array(
+    records: usize,
+    defs: &[i16],
+    max_def: i16,
+    values: &[ByteArray],
+) -> LargeBinaryArray {
+    let mut builder = arrow::array::LargeBinaryBuilder::with_capacity(records, records * 8);
+    if defs.is_empty() || max_def == 0 {
+        for v in values {
+            builder.append_value(v.data());
+        }
+    } else {
+        let mut val_idx = 0usize;
+        for d in defs.iter().take(records) {
+            if *d >= max_def {
+                builder.append_value(values[val_idx].data());
+                val_idx += 1;
+            } else {
+                builder.append_null();
+            }
+        }
+    }
+    builder.finish()
+}
+
+fn scalar_buffer_from_present<T, U, F>(
+    records: usize,
+    defs: &[i16],
+    max_def: i16,
+    values: &[T],
+    cast: F,
+) -> ScalarBuffer<U>
+where
+    T: Copy,
+    U: arrow::datatypes::ArrowNativeType,
+    F: Fn(T) -> U,
+{
+    if defs.is_empty() || max_def == 0 {
+        let casted: Vec<U> = values.iter().copied().map(&cast).collect();
+        return ScalarBuffer::from(casted);
+    }
+    let mut out: Vec<U> = Vec::with_capacity(records);
+    let mut val_idx = 0usize;
+    for d in defs.iter().take(records) {
+        if *d >= max_def {
+            out.push(cast(values[val_idx]));
+            val_idx += 1;
+        } else {
+            out.push(U::default());
+        }
+    }
+    ScalarBuffer::from(out)
+}
+
+/// Compute the `ListArray` offsets vector from Dremel def/rep levels
+/// for the `List<non-nullable primitive>` shape used by DDSketch
+/// `keys` / `counts`. `max_def = 1`, `max_rep = 1`. Each entry with
+/// `def == 1` is a present element; `def == 0` is an empty list slot
+/// (no element). `rep == 0` starts a new row; `rep == 1` continues the
+/// current list.
+fn list_offsets_from_levels(defs: &[i16], reps: &[i16]) -> Vec<i64> {
+    // Compute as `i64` so the same offset buffer can back either a
+    // `ListArray` (truncated to i32 by `wrap_inner_in_list`) or a
+    // `LargeListArray`. For our `max_def = 1, max_rep = 1` shape
+    // there's no risk of overflow in either width.
+    if defs.is_empty() {
+        return vec![0];
+    }
+    let mut offsets: Vec<i64> = Vec::with_capacity(defs.len() + 1);
+    offsets.push(0);
+    let mut current_len: i64 = 0;
+    for i in 0..defs.len() {
+        if i > 0 && reps[i] == 0 {
+            // Boundary between rows: close the previous row's list.
+            offsets.push(current_len);
+        }
+        if defs[i] == 1 {
+            current_len += 1;
+        }
+    }
+    offsets.push(current_len);
+    offsets
+}
+
+/// Build a `ListArray` over Int32-physical values (parquet inner type
+/// Int8/Int16/Int32 or UInt8/UInt16/UInt32). Outer + inner must be
+/// non-nullable.
+fn build_list_i32_array(
+    field: &Field,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[i32],
+) -> Result<ArrayRef, PageDecodeError> {
+    let inner_field = list_inner_field(field);
+    if field.is_nullable() || inner_field.is_nullable() {
+        return Err(PageDecodeError::UnsupportedColumnType {
+            name: field.name().to_string(),
+            physical: PhysicalType::INT32,
+            arrow: field.data_type().clone(),
+        });
+    }
+    let inner_array: ArrayRef = match inner_field.data_type() {
+        DataType::Int8 => {
+            let v: Vec<i8> = values.iter().map(|&x| x as i8).collect();
+            Arc::new(arrow::array::Int8Array::from(v))
+        }
+        DataType::Int16 => {
+            let v: Vec<i16> = values.iter().map(|&x| x as i16).collect();
+            Arc::new(Int16Array::from(v))
+        }
+        DataType::Int32 => Arc::new(Int32Array::from(values.to_vec())),
+        DataType::UInt8 => {
+            let v: Vec<u8> = values.iter().map(|&x| x as u8).collect();
+            Arc::new(UInt8Array::from(v))
+        }
+        DataType::UInt16 => {
+            let v: Vec<u16> = values.iter().map(|&x| x as u16).collect();
+            Arc::new(UInt16Array::from(v))
+        }
+        DataType::UInt32 => {
+            // Bit-reinterpret cast preserves the unsigned-logical
+            // round trip — same convention as the flat path.
+            let v: Vec<u32> = values.iter().map(|&x| x as u32).collect();
+            Arc::new(UInt32Array::from(v))
+        }
+        other => {
+            return Err(PageDecodeError::UnsupportedColumnType {
+                name: field.name().to_string(),
+                physical: PhysicalType::INT32,
+                arrow: other.clone(),
+            });
+        }
+    };
+    wrap_inner_in_list(field, inner_field, inner_array, defs, reps)
+}
+
+/// Build a `ListArray` over Int64-physical values (parquet inner type
+/// Int64 or UInt64). Outer + inner must be non-nullable. DDSketch
+/// `counts` is the primary use.
+fn build_list_i64_array(
+    field: &Field,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[i64],
+) -> Result<ArrayRef, PageDecodeError> {
+    let inner_field = list_inner_field(field);
+    if field.is_nullable() || inner_field.is_nullable() {
+        return Err(PageDecodeError::UnsupportedColumnType {
+            name: field.name().to_string(),
+            physical: PhysicalType::INT64,
+            arrow: field.data_type().clone(),
+        });
+    }
+    let inner_array: ArrayRef = match inner_field.data_type() {
+        DataType::Int64 => Arc::new(Int64Array::from(values.to_vec())),
+        DataType::UInt64 => {
+            let v: Vec<u64> = values.iter().map(|&x| x as u64).collect();
+            Arc::new(UInt64Array::from(v))
+        }
+        other => {
+            return Err(PageDecodeError::UnsupportedColumnType {
+                name: field.name().to_string(),
+                physical: PhysicalType::INT64,
+                arrow: other.clone(),
+            });
+        }
+    };
+    wrap_inner_in_list(field, inner_field, inner_array, defs, reps)
+}
+
+/// Build a `ListArray` / `LargeListArray` over `Float32` (parquet
+/// inner type `Float32`). Outer + inner must be non-nullable.
+fn build_list_f32_array(
+    field: &Field,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[f32],
+) -> Result<ArrayRef, PageDecodeError> {
+    let inner_field = list_inner_field(field);
+    if field.is_nullable() || inner_field.is_nullable() {
+        return Err(PageDecodeError::UnsupportedColumnType {
+            name: field.name().to_string(),
+            physical: PhysicalType::FLOAT,
+            arrow: field.data_type().clone(),
+        });
+    }
+    let inner_array: ArrayRef = match inner_field.data_type() {
+        DataType::Float32 => Arc::new(Float32Array::from(values.to_vec())),
+        other => {
+            return Err(PageDecodeError::UnsupportedColumnType {
+                name: field.name().to_string(),
+                physical: PhysicalType::FLOAT,
+                arrow: other.clone(),
+            });
+        }
+    };
+    wrap_inner_in_list(field, inner_field, inner_array, defs, reps)
+}
+
+/// Build a `ListArray` / `LargeListArray` over `Float64` (parquet
+/// inner type `Double`). Outer + inner must be non-nullable.
+fn build_list_f64_array(
+    field: &Field,
+    defs: &[i16],
+    reps: &[i16],
+    values: &[f64],
+) -> Result<ArrayRef, PageDecodeError> {
+    let inner_field = list_inner_field(field);
+    if field.is_nullable() || inner_field.is_nullable() {
+        return Err(PageDecodeError::UnsupportedColumnType {
+            name: field.name().to_string(),
+            physical: PhysicalType::DOUBLE,
+            arrow: field.data_type().clone(),
+        });
+    }
+    let inner_array: ArrayRef = match inner_field.data_type() {
+        DataType::Float64 => Arc::new(Float64Array::from(values.to_vec())),
+        other => {
+            return Err(PageDecodeError::UnsupportedColumnType {
+                name: field.name().to_string(),
+                physical: PhysicalType::DOUBLE,
+                arrow: other.clone(),
+            });
+        }
+    };
+    wrap_inner_in_list(field, inner_field, inner_array, defs, reps)
+}
+
+/// Extract the inner field from a `List<T>` or `LargeList<T>`
+/// outer field. Callers must already have checked the outer type.
+fn list_inner_field(field: &Field) -> Arc<Field> {
+    match field.data_type() {
+        DataType::List(inner) | DataType::LargeList(inner) => Arc::clone(inner),
+        _ => unreachable!("caller guards on List/LargeList"),
+    }
+}
+
+/// Wrap a decoded inner array in `ListArray` (i32 offsets) or
+/// `LargeListArray` (i64 offsets) according to the outer field's
+/// `DataType`. This preserves the schema's list flavour through the
+/// page decoder — `LargeList<T>` inputs round-trip to `LargeListArray`,
+/// not `ListArray`, so downstream schema validation sees the right
+/// type. Reps + defs are interpreted under `max_def = 1, max_rep = 1`
+/// (the writer's contract for non-nullable outer + non-nullable inner
+/// lists).
+fn wrap_inner_in_list(
+    field: &Field,
+    inner_field: Arc<Field>,
+    inner_array: ArrayRef,
+    defs: &[i16],
+    reps: &[i16],
+) -> Result<ArrayRef, PageDecodeError> {
+    let i64_offsets = list_offsets_from_levels(defs, reps);
+    match field.data_type() {
+        DataType::LargeList(_) => {
+            let offsets = OffsetBuffer::new(ScalarBuffer::from(i64_offsets));
+            Ok(Arc::new(LargeListArray::new(
+                inner_field,
+                offsets,
+                inner_array,
+                None,
+            )))
+        }
+        DataType::List(_) => {
+            let i32_offsets: Vec<i32> = i64_offsets.iter().map(|&o| o as i32).collect();
+            let offsets = OffsetBuffer::new(ScalarBuffer::from(i32_offsets));
+            Ok(Arc::new(ListArray::new(
+                inner_field,
+                offsets,
+                inner_array,
+                None,
+            )))
+        }
+        _ => unreachable!("caller guards on List/LargeList"),
+    }
+}
+
+fn max_def_for(field: &Field) -> i16 {
+    if field.is_nullable() { 1 } else { 0 }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::io;
+    use std::ops::Range;
+    use std::path::{Path, PathBuf};
+
+    use arrow::array::{ArrayRef, DictionaryArray, RecordBatch};
+    use arrow::compute::concat_batches;
+    use arrow::datatypes::{Field as ArrowField, Int32Type, Schema as ArrowSchema};
+    use async_trait::async_trait;
+    use parquet::arrow::ArrowWriter;
+    use parquet::basic::Compression;
+    use parquet::file::properties::WriterProperties;
+    use parquet::file::reader::{FileReader, SerializedFileReader};
+    use tokio::io::AsyncRead;
+
+    use super::*;
+    use crate::storage::streaming_reader::{
+        ColumnPageStream, RemoteByteSource, StreamingParquetReader,
+    };
+
+    fn make_metrics_batch(num_rows: usize) -> RecordBatch {
+        let dict_type = DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+        let schema = Arc::new(ArrowSchema::new(vec![
+            ArrowField::new("metric_name", dict_type.clone(), false),
+            ArrowField::new("metric_type", DataType::UInt8, false),
+            ArrowField::new("timestamp_secs", DataType::UInt64, false),
+            ArrowField::new("value", DataType::Float64, false),
+            ArrowField::new("timeseries_id", DataType::Int64, false),
+            ArrowField::new("service", dict_type, true),
+        ]));
+
+        let metric_keys: Vec<i32> = (0..num_rows as i32).map(|i| i % 2).collect();
+        let metric_values = StringArray::from(vec!["cpu.usage", "memory.used"]);
+        let metric_name: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(metric_keys),
+                Arc::new(metric_values),
+            )
+            .unwrap(),
+        );
+        let metric_type: ArrayRef = Arc::new(UInt8Array::from(vec![0u8; num_rows]));
+        let timestamps: Vec<u64> = (0..num_rows as u64).map(|i| 1_700_000_000 + i).collect();
+        let timestamp_secs: ArrayRef = Arc::new(UInt64Array::from(timestamps));
+        let values: Vec<f64> = (0..num_rows).map(|i| i as f64).collect();
+        let value: ArrayRef = Arc::new(Float64Array::from(values));
+        let tsids: Vec<i64> = (0..num_rows as i64).map(|i| 1000 + i).collect();
+        let timeseries_id: ArrayRef = Arc::new(Int64Array::from(tsids));
+        // sprinkle nulls
+        let svc_keys: Vec<Option<i32>> = (0..num_rows as i32)
+            .map(|i| if i % 5 == 0 { None } else { Some(i % 3) })
+            .collect();
+        let svc_values = StringArray::from(vec!["api", "db", "cache"]);
+        let service: ArrayRef = Arc::new(
+            DictionaryArray::<Int32Type>::try_new(
+                arrow::array::Int32Array::from(svc_keys),
+                Arc::new(svc_values),
+            )
+            .unwrap(),
+        );
+
+        RecordBatch::try_new(
+            schema,
+            vec![
+                metric_name,
+                metric_type,
+                timestamp_secs,
+                value,
+                timeseries_id,
+                service,
+            ],
+        )
+        .unwrap()
+    }
+
+    fn write_parquet(
+        batches: &[RecordBatch],
+        page_row_limit: Option<usize>,
+        rg_row_limit: Option<usize>,
+        compression: Compression,
+    ) -> Vec<u8> {
+        let schema = batches[0].schema();
+        let mut props_builder = WriterProperties::builder().set_compression(compression);
+        if let Some(n) = page_row_limit {
+            props_builder = props_builder.set_data_page_row_count_limit(n);
+        }
+        if let Some(n) = rg_row_limit {
+            props_builder = props_builder.set_max_row_group_row_count(Some(n));
+        }
+        let props = props_builder.build();
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut buf, schema, Some(props)).unwrap();
+        for b in batches {
+            writer.write(b).unwrap();
+        }
+        writer.close().unwrap();
+        buf
+    }
+
+    #[derive(Clone)]
+    struct InMemorySource {
+        bytes: Bytes,
+    }
+    impl InMemorySource {
+        fn new(bytes: Vec<u8>) -> Arc<Self> {
+            Arc::new(Self {
+                bytes: Bytes::from(bytes),
+            })
+        }
+    }
+    #[async_trait]
+    impl RemoteByteSource for InMemorySource {
+        async fn file_size(&self, _path: &Path) -> io::Result<u64> {
+            Ok(self.bytes.len() as u64)
+        }
+        async fn get_slice(&self, _path: &Path, range: Range<u64>) -> io::Result<Bytes> {
+            Ok(self.bytes.slice(range.start as usize..range.end as usize))
+        }
+        async fn get_slice_stream(
+            &self,
+            _path: &Path,
+            range: Range<u64>,
+        ) -> io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+            let slice = self.bytes.slice(range.start as usize..range.end as usize);
+            Ok(Box::new(io::Cursor::new(slice.to_vec())))
+        }
+    }
+    fn dummy_path() -> PathBuf {
+        PathBuf::from("test.parquet")
+    }
+
+    fn read_canonical(bytes: &[u8]) -> RecordBatch {
+        let cursor = Bytes::copy_from_slice(bytes);
+        let builder =
+            parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder::try_new(cursor).unwrap();
+        let schema = builder.schema().clone();
+        let reader = builder.build().unwrap();
+        let batches: Vec<RecordBatch> = reader.collect::<Result<Vec<_>, _>>().unwrap();
+        if batches.is_empty() {
+            RecordBatch::new_empty(schema)
+        } else {
+            concat_batches(&schema, &batches).unwrap()
+        }
+    }
+
+    /// Drain every DecodedPage from the decoder and assemble the
+    /// resulting per-(rg, col) arrays back into a single RecordBatch by
+    /// concatenation in storage order. This is what PR-6b would do if it
+    /// wanted a full RecordBatch view; the decoder itself never
+    /// materialises one.
+    async fn drain_to_record_batch(reader: &mut StreamingParquetReader) -> RecordBatch {
+        let metadata = Arc::clone(reader.metadata());
+        let parquet_schema = metadata.file_metadata().schema_descr();
+        // None for kv_metadata: skip the ARROW:schema hint so the
+        // computed schema matches what the decoder actually produces
+        // (Utf8 instead of Dictionary<Int32, Utf8>, etc.).
+        let arrow_schema = parquet::arrow::parquet_to_arrow_schema(parquet_schema, None).unwrap();
+        let num_cols = arrow_schema.fields().len();
+        let num_rgs = metadata.num_row_groups();
+
+        let mut per_col: Vec<Vec<ArrayRef>> = vec![Vec::new(); num_cols];
+
+        let mut decoder = StreamDecoder::new(reader as &mut dyn ColumnPageStream);
+        while let Some(dp) = decoder.decode_next_page().await.unwrap() {
+            per_col[dp.col_idx].push(dp.array);
+        }
+
+        let _ = num_rgs;
+        let columns: Vec<ArrayRef> = per_col
+            .into_iter()
+            .map(|chunks| {
+                let refs: Vec<&dyn Array> = chunks.iter().map(|a| a.as_ref()).collect();
+                arrow::compute::concat(&refs).unwrap()
+            })
+            .collect();
+        RecordBatch::try_new(Arc::new(arrow_schema), columns).unwrap()
+    }
+
+    #[tokio::test]
+    async fn test_drain_single_rg_round_trip() {
+        let batch = make_metrics_batch(64);
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            None,
+            None,
+            Compression::SNAPPY,
+        );
+        let canonical = read_canonical(&bytes);
+
+        let source = InMemorySource::new(bytes);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let drained = drain_to_record_batch(&mut reader).await;
+
+        // Compare per-column. Dict columns decode to their value type
+        // (Utf8), so compare against the canonical's value-cast.
+        for (col_idx, want_field) in canonical.schema().fields().iter().enumerate() {
+            let want = canonical.column(col_idx);
+            let got = drained.column(col_idx);
+            assert_eq!(want.len(), got.len(), "col {col_idx} length mismatch",);
+            // Cast the canonical to the decoded type for comparison.
+            let want_cast = arrow::compute::cast(want, got.data_type()).unwrap();
+            assert_eq!(
+                want_cast.as_ref(),
+                got.as_ref(),
+                "col {col_idx} ({}) data mismatch",
+                want_field.name(),
+            );
+        }
+    }
+
+    #[tokio::test]
+    async fn test_drain_multi_rg_round_trip() {
+        let batch = make_metrics_batch(300);
+        let bytes = write_parquet(&[batch], None, Some(100), Compression::SNAPPY);
+        let canonical = read_canonical(&bytes);
+
+        let source = InMemorySource::new(bytes.clone());
+        let sync_reader = SerializedFileReader::new(Bytes::from(bytes)).unwrap();
+        assert!(
+            sync_reader.metadata().num_row_groups() >= 2,
+            "fixture must produce multi-RG",
+        );
+
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let drained = drain_to_record_batch(&mut reader).await;
+        for (col_idx, _) in canonical.schema().fields().iter().enumerate() {
+            let want = canonical.column(col_idx);
+            let got = drained.column(col_idx);
+            assert_eq!(want.len(), got.len());
+            let want_cast = arrow::compute::cast(want, got.data_type()).unwrap();
+            assert_eq!(want_cast.as_ref(), got.as_ref(), "col {col_idx} mismatch");
+        }
+    }
+
+    /// Each `decode_next_page` returns exactly one data page worth of
+    /// rows — `row_start + array.len()` advances monotonically per
+    /// (rg, col), with row_start = 0 at the start of each (rg, col).
+    #[tokio::test]
+    async fn test_decoded_page_row_indexing() {
+        let batch = make_metrics_batch(2048);
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            Some(256),
+            None,
+            Compression::SNAPPY,
+        );
+
+        let source = InMemorySource::new(bytes);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let mut decoder = StreamDecoder::new(&mut reader as &mut dyn ColumnPageStream);
+
+        let mut per_col_cumulative: HashMap<(usize, usize), usize> = HashMap::new();
+        let mut next_idx: HashMap<(usize, usize), usize> = HashMap::new();
+        while let Some(dp) = decoder.decode_next_page().await.unwrap() {
+            let key = (dp.rg_idx, dp.col_idx);
+            let prior = per_col_cumulative.get(&key).copied().unwrap_or(0);
+            assert_eq!(
+                dp.row_start, prior,
+                "row_start for ({}, {}) page {} should equal prior cumulative",
+                dp.rg_idx, dp.col_idx, dp.page_idx_in_col,
+            );
+            per_col_cumulative.insert(key, prior + dp.array.len());
+
+            let expected_idx = next_idx.get(&key).copied().unwrap_or(0);
+            assert_eq!(dp.page_idx_in_col, expected_idx);
+            next_idx.insert(key, expected_idx + 1);
+        }
+    }
+
+    /// `decode_next_page` is idempotent at EOF.
+    #[tokio::test]
+    async fn test_eof_idempotent() {
+        let batch = make_metrics_batch(32);
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            None,
+            None,
+            Compression::SNAPPY,
+        );
+        let source = InMemorySource::new(bytes);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let mut decoder = StreamDecoder::new(&mut reader as &mut dyn ColumnPageStream);
+        while decoder.decode_next_page().await.unwrap().is_some() {}
+        assert!(decoder.decode_next_page().await.unwrap().is_none());
+        assert!(decoder.decode_next_page().await.unwrap().is_none());
+    }
+
+    /// Nullable columns: `service` has nulls every 5th row. The decoded
+    /// page must surface those nulls — the null mask round-trips.
+    #[tokio::test]
+    async fn test_nullable_column_round_trip() {
+        let batch = make_metrics_batch(50);
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            None,
+            None,
+            Compression::SNAPPY,
+        );
+        let canonical = read_canonical(&bytes);
+        let svc_idx = canonical.schema().index_of("service").unwrap();
+        assert!(canonical.column(svc_idx).null_count() > 0);
+
+        let source = InMemorySource::new(bytes);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let drained = drain_to_record_batch(&mut reader).await;
+        assert_eq!(
+            drained.column(svc_idx).null_count(),
+            canonical.column(svc_idx).null_count(),
+        );
+    }
+
+    /// Compression codec round trips. The crate's parquet feature set
+    /// is `arrow + snap + zstd` (see `quickwit/Cargo.toml`); LZ4 is
+    /// intentionally not in scope.
+    #[tokio::test]
+    async fn test_compression_codecs() {
+        for compression in [
+            Compression::UNCOMPRESSED,
+            Compression::SNAPPY,
+            Compression::ZSTD(parquet::basic::ZstdLevel::default()),
+        ] {
+            let batch = make_metrics_batch(64);
+            let bytes = write_parquet(std::slice::from_ref(&batch), None, None, compression);
+            let canonical = read_canonical(&bytes);
+            let source = InMemorySource::new(bytes);
+            let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+                .await
+                .unwrap();
+            let drained = drain_to_record_batch(&mut reader).await;
+            for col_idx in 0..canonical.num_columns() {
+                let want = canonical.column(col_idx);
+                let got = drained.column(col_idx);
+                assert_eq!(want.len(), got.len());
+                let want_cast = arrow::compute::cast(want, got.data_type()).unwrap();
+                assert_eq!(
+                    want_cast.as_ref(),
+                    got.as_ref(),
+                    "compression {compression:?} col {col_idx} diverged",
+                );
+            }
+        }
+    }
+
+    /// The decoder must not buffer the row group: across a long stream,
+    /// the number of pages held in any single column's `PageQueue` at
+    /// any instant stays ≤ 2 (at most a queued dictionary plus the
+    /// current data page). This is a structural check of the page-
+    /// bounded contract.
+    #[tokio::test]
+    async fn test_page_bounded_queue_depth() {
+        let batch = make_metrics_batch(8192);
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            Some(256),
+            None,
+            Compression::SNAPPY,
+        );
+        let source = InMemorySource::new(bytes);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let mut decoder = StreamDecoder::new(&mut reader as &mut dyn ColumnPageStream);
+        while decoder.decode_next_page().await.unwrap().is_some() {
+            for (_, state) in decoder.columns.iter() {
+                let depth = state.queue.lock().unwrap().len();
+                assert!(
+                    depth <= 2,
+                    "PageQueue depth {depth} exceeds page-bounded contract (≤2)",
+                );
+            }
+        }
+    }
+
+    /// `List<UInt64>` (the DDSketch `counts` shape) round-trips with
+    /// variable list lengths including the empty list and `u64::MAX`.
+    /// This exercises the Dremel level → ListArray reconstruction in
+    /// `build_list_i64_array`.
+    #[tokio::test]
+    async fn test_list_uint64_round_trip() {
+        use arrow::array::ListBuilder;
+
+        let item_field = Arc::new(ArrowField::new("item", DataType::UInt64, false));
+        let schema = Arc::new(ArrowSchema::new(vec![ArrowField::new(
+            "counts",
+            DataType::List(Arc::clone(&item_field)),
+            false,
+        )]));
+
+        let rows: Vec<Vec<u64>> = vec![
+            vec![1, 2, 3],
+            vec![],
+            vec![42],
+            vec![u64::MAX, 0, 0x8000_0000_0000_0000],
+            vec![],
+            vec![100],
+        ];
+        let mut builder = ListBuilder::new(arrow::array::UInt64Builder::new())
+            .with_field(Arc::clone(&item_field));
+        for row in &rows {
+            for &v in row {
+                builder.values().append_value(v);
+            }
+            builder.append(true);
+        }
+        let counts: ArrayRef = Arc::new(builder.finish());
+        let batch = RecordBatch::try_new(schema.clone(), vec![counts]).unwrap();
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            None,
+            None,
+            Compression::SNAPPY,
+        );
+
+        let source = InMemorySource::new(bytes);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let mut decoder = StreamDecoder::new(&mut reader as &mut dyn ColumnPageStream);
+        let dp = decoder
+            .decode_next_page()
+            .await
+            .unwrap()
+            .expect("at least one page");
+        let got_list = dp
+            .array
+            .as_any()
+            .downcast_ref::<ListArray>()
+            .expect("counts must decode to ListArray");
+        assert_eq!(got_list.len(), rows.len());
+        for (row_idx, want) in rows.iter().enumerate() {
+            let got = got_list.value(row_idx);
+            let got_u64: Vec<u64> = got
+                .as_any()
+                .downcast_ref::<UInt64Array>()
+                .expect("inner must be UInt64Array")
+                .values()
+                .to_vec();
+            assert_eq!(got_u64, *want, "row {row_idx} list mismatch");
+        }
+    }
+
+    /// `List<Float64>` round-trips through the decoder as a `ListArray`
+    /// with a `Float64Array` inner — NOT as a flat `Float64Array`. The
+    /// type/row shape must match what the streaming writer advertises
+    /// for `List<Float64>` columns. Regression test for the codex
+    /// review on PR-6407.
+    #[tokio::test]
+    async fn test_list_float64_round_trip() {
+        use arrow::array::ListBuilder;
+
+        let item_field = Arc::new(ArrowField::new("item", DataType::Float64, false));
+        let schema = Arc::new(ArrowSchema::new(vec![ArrowField::new(
+            "samples",
+            DataType::List(Arc::clone(&item_field)),
+            false,
+        )]));
+
+        let rows: Vec<Vec<f64>> = vec![
+            vec![1.0, 2.5, -7.5],
+            vec![],
+            vec![f64::MAX, f64::MIN, 0.0],
+            vec![42.42],
+        ];
+        let mut builder = ListBuilder::new(arrow::array::Float64Builder::new())
+            .with_field(Arc::clone(&item_field));
+        for row in &rows {
+            for &v in row {
+                builder.values().append_value(v);
+            }
+            builder.append(true);
+        }
+        let samples: ArrayRef = Arc::new(builder.finish());
+        let batch = RecordBatch::try_new(schema.clone(), vec![samples]).unwrap();
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            None,
+            None,
+            Compression::SNAPPY,
+        );
+
+        let source = InMemorySource::new(bytes);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let mut decoder = StreamDecoder::new(&mut reader as &mut dyn ColumnPageStream);
+        let dp = decoder
+            .decode_next_page()
+            .await
+            .unwrap()
+            .expect("at least one page");
+        let got_list = dp
+            .array
+            .as_any()
+            .downcast_ref::<ListArray>()
+            .expect("samples must decode to ListArray, not flat Float64Array");
+        assert_eq!(got_list.len(), rows.len());
+        for (row_idx, want) in rows.iter().enumerate() {
+            let got = got_list.value(row_idx);
+            let got_f64: Vec<f64> = got
+                .as_any()
+                .downcast_ref::<Float64Array>()
+                .expect("inner must be Float64Array")
+                .values()
+                .to_vec();
+            assert_eq!(got_f64, *want, "row {row_idx} list mismatch");
+        }
+    }
+
+    /// `List<UInt64>` records that span multiple V1 pages are decoded
+    /// without splitting. Regression test for the codex review on
+    /// PR-6407: prior to the one-page lookahead in `decode_next_page`,
+    /// `peek_next_page` returned `None` at every page boundary, which
+    /// parquet-rs treats as "last page" — it would flush partial
+    /// repetition-level state and emit incomplete records.
+    ///
+    /// We force the issue by writing a long list with a tiny
+    /// `data_page_size_limit`, so parquet-rs splits the single
+    /// list record across multiple V1 pages. With the lookahead in
+    /// place, the column reader sees `peek_next_page = Some(_)` and
+    /// continues consuming until the record completes.
+    #[tokio::test]
+    async fn test_list_record_spanning_pages_preserved() {
+        use arrow::array::{ListBuilder, UInt64Array, UInt64Builder};
+
+        let item_field = Arc::new(ArrowField::new("item", DataType::UInt64, false));
+        let schema = Arc::new(ArrowSchema::new(vec![ArrowField::new(
+            "counts",
+            DataType::List(Arc::clone(&item_field)),
+            false,
+        )]));
+
+        // Two records: a 50-element list (forced to span several pages
+        // by the 20-byte page size limit) and a short 3-element list.
+        let row_long: Vec<u64> = (0..50u64).collect();
+        let row_short: Vec<u64> = vec![1, 2, 3];
+        let mut builder =
+            ListBuilder::new(UInt64Builder::new()).with_field(Arc::clone(&item_field));
+        for v in &row_long {
+            builder.values().append_value(*v);
+        }
+        builder.append(true);
+        for v in &row_short {
+            builder.values().append_value(*v);
+        }
+        builder.append(true);
+        let counts: ArrayRef = Arc::new(builder.finish());
+        let batch = RecordBatch::try_new(schema.clone(), vec![counts]).unwrap();
+
+        // 20-byte page-size limit forces V1 pages to split the
+        // 50-element list across several pages.
+        let props = WriterProperties::builder()
+            .set_compression(Compression::UNCOMPRESSED)
+            .set_data_page_size_limit(20)
+            .build();
+        let mut buf: Vec<u8> = Vec::new();
+        let mut writer = ArrowWriter::try_new(&mut buf, schema, Some(props)).unwrap();
+        writer.write(&batch).unwrap();
+        writer.close().unwrap();
+
+        // Decode all pages and accumulate the lists.
+        let source = InMemorySource::new(buf);
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let mut decoder = StreamDecoder::new(&mut reader as &mut dyn ColumnPageStream);
+        let mut all_records: Vec<Vec<u64>> = Vec::new();
+        while let Some(dp) = decoder.decode_next_page().await.unwrap() {
+            let list = dp
+                .array
+                .as_any()
+                .downcast_ref::<ListArray>()
+                .expect("counts must decode to ListArray");
+            for i in 0..list.len() {
+                let inner = list.value(i);
+                let u64_arr = inner
+                    .as_any()
+                    .downcast_ref::<UInt64Array>()
+                    .expect("inner must be UInt64Array");
+                all_records.push(u64_arr.values().to_vec());
+            }
+        }
+
+        assert_eq!(
+            all_records.len(),
+            2,
+            "must emit exactly two records, got {}: {all_records:?}",
+            all_records.len()
+        );
+        assert_eq!(
+            all_records[0], row_long,
+            "first record (50 elements) must be preserved intact across page boundaries"
+        );
+        assert_eq!(all_records[1], row_short, "second record must be preserved",);
+    }
+
+    /// `wrap_inner_in_list` dispatches to `LargeListArray` (i64 offsets)
+    /// when the outer field is `LargeList<>`, and to `ListArray` (i32
+    /// offsets) when it's `List<>`. Regression for codex review on
+    /// PR-6407: the builders accept either outer flavour but
+    /// previously always constructed `ListArray`.
+    ///
+    /// Tested at the helper level rather than via parquet round-trip:
+    /// `init_column_state` derives fields from
+    /// `parquet_to_arrow_schema(_, None)`, which only produces
+    /// `List<>` (parquet's native schema doesn't distinguish list
+    /// offset widths). The `LargeList<>` branch is reachable only
+    /// when callers construct fields directly, so we exercise it
+    /// directly.
+    #[test]
+    fn test_wrap_inner_in_list_dispatches_on_outer_flavour() {
+        use arrow::array::UInt64Array;
+
+        let inner_field = Arc::new(ArrowField::new("item", DataType::UInt64, false));
+        let inner_array: ArrayRef = Arc::new(UInt64Array::from(vec![1u64, 2, 3, 42, 100, 200]));
+        // Three rows: [1,2,3], [42], [100,200] → defs/reps that
+        // `list_offsets_from_levels` translates to offsets [0,3,4,6].
+        let defs = vec![1, 1, 1, 1, 1, 1];
+        let reps = vec![0, 1, 1, 0, 0, 1];
+
+        // List<UInt64> path → ListArray with i32 offsets.
+        let list_field = ArrowField::new(
+            "list_field",
+            DataType::List(Arc::clone(&inner_field)),
+            false,
+        );
+        let got_list = wrap_inner_in_list(
+            &list_field,
+            Arc::clone(&inner_field),
+            Arc::clone(&inner_array),
+            &defs,
+            &reps,
+        )
+        .expect("list dispatch");
+        let list_arr = got_list
+            .as_any()
+            .downcast_ref::<ListArray>()
+            .expect("list field must produce ListArray");
+        assert_eq!(list_arr.len(), 3);
+        assert_eq!(list_arr.value(0).len(), 3);
+        assert_eq!(list_arr.value(1).len(), 1);
+        assert_eq!(list_arr.value(2).len(), 2);
+
+        // LargeList<UInt64> path → LargeListArray with i64 offsets.
+        let large_field = ArrowField::new(
+            "large_field",
+            DataType::LargeList(Arc::clone(&inner_field)),
+            false,
+        );
+        let got_large = wrap_inner_in_list(
+            &large_field,
+            Arc::clone(&inner_field),
+            Arc::clone(&inner_array),
+            &defs,
+            &reps,
+        )
+        .expect("large list dispatch");
+        let large_arr = got_large
+            .as_any()
+            .downcast_ref::<LargeListArray>()
+            .expect("LargeList field must produce LargeListArray, not ListArray");
+        assert_eq!(large_arr.len(), 3);
+        assert_eq!(large_arr.value(0).len(), 3);
+        assert_eq!(large_arr.value(1).len(), 1);
+        assert_eq!(large_arr.value(2).len(), 2);
+    }
+
+    /// I/O failures from the page stream surface as
+    /// `PageDecodeError::PageStream(ParquetReadError::Io(_))`.
+    #[tokio::test]
+    async fn test_io_failure_surfaces_as_page_stream_error() {
+        struct FailingBodySource {
+            footer: Bytes,
+            file_size: u64,
+        }
+        #[async_trait]
+        impl RemoteByteSource for FailingBodySource {
+            async fn file_size(&self, _path: &Path) -> io::Result<u64> {
+                Ok(self.file_size)
+            }
+            async fn get_slice(&self, _path: &Path, range: Range<u64>) -> io::Result<Bytes> {
+                if range.start >= self.file_size - self.footer.len() as u64 {
+                    let foot_start = self.file_size - self.footer.len() as u64;
+                    let off = (range.start - foot_start) as usize;
+                    let len = (range.end - range.start) as usize;
+                    return Ok(self.footer.slice(off..off + len));
+                }
+                Err(io::Error::other("simulated body get failure"))
+            }
+            async fn get_slice_stream(
+                &self,
+                _path: &Path,
+                _range: Range<u64>,
+            ) -> io::Result<Box<dyn AsyncRead + Send + Unpin>> {
+                Err(io::Error::other("simulated body stream failure"))
+            }
+        }
+        let batch = make_metrics_batch(16);
+        let bytes = write_parquet(
+            std::slice::from_ref(&batch),
+            None,
+            None,
+            Compression::SNAPPY,
+        );
+        let file_size = bytes.len() as u64;
+        let source = Arc::new(FailingBodySource {
+            footer: Bytes::from(bytes),
+            file_size,
+        });
+        let mut reader = StreamingParquetReader::try_open(source, dummy_path())
+            .await
+            .unwrap();
+        let mut decoder = StreamDecoder::new(&mut reader as &mut dyn ColumnPageStream);
+        let err = decoder.decode_next_page().await.unwrap_err();
+        match err {
+            PageDecodeError::PageStream(ParquetReadError::Io(_)) => {}
+            other => panic!("expected PageStream(Io), got {other:?}"),
+        }
+    }
+}
diff --git a/quickwit/quickwit-parquet-engine/src/storage/streaming_reader.rs b/quickwit/quickwit-parquet-engine/src/storage/streaming_reader.rs
index 50eea40e94f..5001935e13c 100644
--- a/quickwit/quickwit-parquet-engine/src/storage/streaming_reader.rs
+++ b/quickwit/quickwit-parquet-engine/src/storage/streaming_reader.rs
@@ -66,7 +66,7 @@ use tokio::io::{AsyncRead, AsyncReadExt};
 /// no business pulling in). Callers in `quickwit-indexing` provide a
 /// thin adapter that delegates to `quickwit_storage::Storage`.
 #[async_trait]
-pub(crate) trait RemoteByteSource: Send + Sync {
+pub trait RemoteByteSource: Send + Sync {
     /// Total file length in bytes.
     async fn file_size(&self, path: &Path) -> io::Result<u64>;
 
@@ -85,7 +85,7 @@ pub(crate) trait RemoteByteSource: Send + Sync {
 
 /// Configuration for [`StreamingParquetReader`].
 #[derive(Debug, Clone)]
-pub(crate) struct StreamingReaderConfig {
+pub struct StreamingReaderConfig {
     /// Bytes prefetched from the file tail to capture the footer.
     /// Default 256 KiB — sized for a 50 MB metrics split with the
     /// writer config we ship in production.
@@ -107,7 +107,7 @@ impl Default for StreamingReaderConfig {
 
 /// Errors from the streaming reader.
 #[derive(Error, Debug)]
-pub(crate) enum ParquetReadError {
+pub enum ParquetReadError {
     /// I/O error from the underlying [`RemoteByteSource`].
     #[error("io error: {0}")]
     Io(#[from] io::Error),
@@ -155,14 +155,20 @@ pub(crate) enum ParquetReadError {
 
 /// One Parquet page yielded by [`StreamingParquetReader::next_page`].
 ///
-/// Carries the Thrift-decoded `PageHeader` plus the raw compressed
-/// bytes (`bytes.len() == header.compressed_page_size`). Caller can
-/// inspect the page type (`Dictionary` / `DataPage` / `DataPageV2` /
-/// `Index`) via `header.type_`, and either copy `bytes` straight to
-/// an output writer (PR-6's direct page copy) or decompress + decode
-/// for sort-column inspection.
+/// Carries the Thrift-decoded `PageHeader`, the original Thrift-compact
+/// bytes for that header (`header_bytes`), and the raw compressed page
+/// bytes (`bytes.len() == header.compressed_page_size`). Caller can:
+///
+/// - Inspect the page type (`Dictionary` / `DataPage` / `DataPageV2` / `Index`) via `header.type_`.
+/// - Copy `bytes` straight to an output writer (PR-6's direct page copy).
+/// - Reconstruct the original column-chunk byte stream by concatenating `header_bytes ++ bytes` for
+///   every page in storage order — what PR-6a's page decoder uses to feed pages back into the
+///   standard parquet record-batch reader without re-encoding (re-encoding is deterministic for
+///   Thrift compact, but byte-exact passthrough avoids any encoder-version drift inside the
+///   compactor).
+/// - Decompress + decode `bytes` for sort-column inspection.
 #[derive(Debug)]
-pub(crate) struct Page {
+pub struct Page {
     /// Row group this page belongs to.
     pub rg_idx: usize,
     /// Column chunk this page belongs to (within the row group).
@@ -172,17 +178,62 @@ pub(crate) struct Page {
     pub page_idx_in_col: usize,
     /// Thrift-decoded page header.
     pub header: PageHeader,
+    /// Original Thrift-compact bytes for `header`, exactly as they
+    /// appeared on the wire. `header_bytes.len()` equals the number of
+    /// bytes the parser consumed to decode `header`.
+    pub header_bytes: Bytes,
     /// Raw compressed page bytes; length equals
     /// `header.compressed_page_size`. Cheap to clone (ref-counted).
     pub bytes: Bytes,
 }
 
+/// Object-safe page stream — the contract that PR-6's merge engine
+/// consumes for every input file.
+///
+/// Two implementations:
+/// - [`StreamingParquetReader`]: streams pages directly from a remote byte source (the new-format
+///   fast path).
+/// - The legacy adapter (PR-5): buffers a whole file into a `RecordBatch`, re-encodes it as a
+///   single-row-group parquet stream in memory, and presents it through this trait — used when an
+///   input's row-group boundaries do not align with the sort prefix (`qh.rg_partition_prefix_len ==
+///   0` AND `num_row_groups > 1`).
+///
+/// # Contract
+/// - [`Self::metadata`] returns the file's parsed metadata. Callable any time after construction;
+///   does not consume the stream.
+/// - [`Self::next_page`] yields pages in storage order: row-group-major,
+///   column-major-within-row-group, page-major-within-column. Returns `Ok(None)` once the file is
+///   fully drained, and stays `Ok(None)` on subsequent calls (idempotent EOF).
+/// - I/O failures surface as [`ParquetReadError::Io`]; they are not masked as decode errors.
+#[async_trait]
+pub trait ColumnPageStream: Send {
+    /// Parsed file metadata. Schema and row-group layout come from
+    /// here; the caller does not need to issue any further I/O to
+    /// inspect them.
+    fn metadata(&self) -> &Arc<ParquetMetaData>;
+
+    /// Read the next page in storage order. Returns `Ok(None)` after
+    /// the last page; further calls continue to return `Ok(None)`.
+    async fn next_page(&mut self) -> Result<Option<Page>, ParquetReadError>;
+}
+
+#[async_trait]
+impl ColumnPageStream for StreamingParquetReader {
+    fn metadata(&self) -> &Arc<ParquetMetaData> {
+        StreamingParquetReader::metadata(self)
+    }
+
+    async fn next_page(&mut self) -> Result<Option<Page>, ParquetReadError> {
+        StreamingParquetReader::next_page(self).await
+    }
+}
+
 /// Page-level streaming Parquet reader.
 ///
 /// See module docs for the contract. Caller must consume pages in
 /// storage order via [`Self::next_page`]; the body stream is forward-
 /// only.
-pub(crate) struct StreamingParquetReader {
+pub struct StreamingParquetReader {
     source: Arc<dyn RemoteByteSource>,
     path: PathBuf,
     file_size: u64,
@@ -385,7 +436,7 @@ async fn read_one_page(
     // protocol. Header is variable-length; iterate until we have
     // enough buffered to parse, capped at `max_page_header_bytes`.
     let header_offset = state.cursor;
-    let (header, header_len) =
+    let (header, header_len, header_bytes) =
         parse_page_header_streaming(state, config.max_page_header_bytes, header_offset).await?;
 
     // Header was consumed from `pending`; `cursor` and `bytes_consumed_in_col`
@@ -423,6 +474,7 @@ async fn read_one_page(
         col_idx,
         page_idx_in_col,
         header,
+        header_bytes,
         bytes: Bytes::from(body_bytes),
     })
 }
@@ -430,12 +482,15 @@ async fn read_one_page(
 /// Read the next Thrift `PageHeader` by trying to decode from
 /// progressively-larger buffer sizes. Drains the consumed bytes from
 /// `state.pending` and advances `state.cursor` and
-/// `state.bytes_consumed_in_col`.
+/// `state.bytes_consumed_in_col`. Returns the parsed header plus the
+/// raw Thrift-compact bytes that backed it, so callers (e.g. the
+/// page-stream → record-batch decoder) can reconstruct the original
+/// column-chunk byte layout without re-encoding.
 async fn parse_page_header_streaming(
     state: &mut ReadingState,
     max_header_bytes: usize,
     file_offset_for_error: u64,
-) -> Result<(PageHeader, usize), ParquetReadError> {
+) -> Result<(PageHeader, usize, Bytes), ParquetReadError> {
     // Start small; grow geometrically up to the configured cap.
     let mut target = 256.min(max_header_bytes);
     loop {
@@ -449,10 +504,10 @@ async fn parse_page_header_streaming(
         fill_pending_best_effort(state, target).await?;
         match try_parse_page_header(&state.pending) {
             Ok((header, consumed)) => {
-                state.pending.drain(..consumed);
+                let header_bytes: Vec<u8> = state.pending.drain(..consumed).collect();
                 state.cursor += consumed as u64;
                 state.bytes_consumed_in_col += consumed as u64;
-                return Ok((header, consumed));
+                return Ok((header, consumed, Bytes::from(header_bytes)));
             }
             Err(thrift_err) => {
                 // Some thrift errors are recoverable by reading more
@@ -1398,4 +1453,70 @@ mod tests {
             }
         }
     }
+
+    // -------- ColumnPageStream trait dispatch --------
+
+    /// Drain a stream behind `&mut dyn ColumnPageStream`. Same contract
+    /// as the concrete-typed `drain_pages`, just exercising the trait
+    /// surface PR-5 and PR-6 will consume.
+    async fn drain_pages_via_trait(stream: &mut dyn ColumnPageStream) -> Vec<Page> {
+        let mut pages = Vec::new();
+        while let Some(p) = stream.next_page().await.unwrap() {
+            pages.push(p);
+        }
+        pages
+    }
+
+    /// `StreamingParquetReader` must satisfy the `ColumnPageStream`
+    /// contract when reached through `&mut dyn ColumnPageStream`.
+    /// Same row count, same storage order, same idempotent EOF as the
+    /// concrete-typed path.
+    #[tokio::test]
+    async fn test_streaming_reader_satisfies_column_page_stream_trait() {
+        let batch = make_metrics_batch(2048);
+        let bytes = write_test_file_multi_page(std::slice::from_ref(&batch), 256);
+
+        // Concrete-typed reference run.
+        let source_concrete = InMemorySource::new(bytes.clone());
+        let mut reader_concrete = StreamingParquetReader::try_open(source_concrete, dummy_path())
+            .await
+            .unwrap();
+        let pages_concrete = drain_pages(&mut reader_concrete).await;
+
+        // Trait-object run.
+        let source_trait = InMemorySource::new(bytes);
+        let reader_trait = StreamingParquetReader::try_open(source_trait, dummy_path())
+            .await
+            .unwrap();
+        // Also exercise `metadata()` through the trait surface and
+        // confirm it agrees with the concrete impl before draining.
+        let trait_metadata_num_rgs = {
+            let stream: &dyn ColumnPageStream = &reader_trait;
+            stream.metadata().num_row_groups()
+        };
+        assert_eq!(
+            trait_metadata_num_rgs,
+            reader_concrete.metadata().num_row_groups(),
+        );
+
+        let mut reader_trait = reader_trait;
+        let pages_trait = drain_pages_via_trait(&mut reader_trait).await;
+
+        // Idempotent EOF through the trait surface — second call after
+        // drain still returns Ok(None).
+        {
+            let stream: &mut dyn ColumnPageStream = &mut reader_trait;
+            assert!(stream.next_page().await.unwrap().is_none());
+        }
+
+        // Same number of pages, same (rg, col, page_idx_in_col) tuple
+        // sequence — i.e., trait dispatch preserves storage order.
+        assert_eq!(pages_concrete.len(), pages_trait.len());
+        for (a, b) in pages_concrete.iter().zip(pages_trait.iter()) {
+            assert_eq!(a.rg_idx, b.rg_idx);
+            assert_eq!(a.col_idx, b.col_idx);
+            assert_eq!(a.page_idx_in_col, b.page_idx_in_col);
+            assert_eq!(a.header.compressed_page_size, b.header.compressed_page_size);
+        }
+    }
 }
diff --git a/quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs b/quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs
index 550c03ce318..df903cacee2 100644
--- a/quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs
+++ b/quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs
@@ -153,6 +153,16 @@ impl<W: Write + Send> StreamingParquetWriter<W> {
         })
     }
 
+    /// Append a `KeyValue` entry to the file's footer metadata. Useful
+    /// when the value can only be computed after all row groups have
+    /// been written (e.g., per-output `qh.row_keys` / `qh.zonemap_regexes`
+    /// in the streaming merge, which derive from the rows that
+    /// physically land in this output file). Must be called before
+    /// [`Self::close`] flushes the footer.
+    pub(crate) fn append_key_value_metadata(&mut self, kv: parquet::file::metadata::KeyValue) {
+        self.file_writer.append_key_value_metadata(kv);
+    }
+
     /// Close the file and return its metadata.
     pub(crate) fn close(self) -> Result<ParquetMetaData, ParquetWriteError> {
         Ok(self.file_writer.close()?)
@@ -570,14 +580,15 @@ fn write_array_via_serialized_column_writer(
                 &materialized_ref,
             );
         }
-        // `List<T>` / `LargeList<T>` with non-nullable outer + inner.
-        // The DDSketch `keys` (`List<Int16>`) and `counts`
-        // (`List<UInt64>`) columns are this shape. We compute Dremel
-        // def/rep levels from each input array and write them through
-        // the same `SerializedColumnWriter::write_batch` call the flat
+        // `List<T>` / `LargeList<T>` with non-nullable inner. The outer
+        // may be either nullable (schema-evolution case where the col
+        // is present in only some inputs) or non-nullable (e.g.
+        // DDSketch `keys` / `counts`). We compute Dremel def/rep
+        // levels from each input array and write them through the
+        // same `SerializedColumnWriter::write_batch` call the flat
         // path uses, so memory stays bounded by one in-flight page.
         DataType::List(_) | DataType::LargeList(_) => {
-            write_non_nullable_list_via_serialized_column_writer(col_writer, field, array)?;
+            write_list_via_serialized_column_writer(col_writer, field, array)?;
         }
         // Multi-leaf nested (Struct, Map) and other unsupported types.
         // Single-leaf multi-child Structs are rejected at
@@ -597,37 +608,24 @@ fn write_array_via_serialized_column_writer(
     Ok(())
 }
 
-/// Page-bounded write for `List<T>` / `LargeList<T>` where the outer
-/// field is non-nullable and the inner field is non-nullable. Computes
-/// Dremel def/rep levels (max_def = 1, max_rep = 1) and dispatches the
-/// flat inner values through the same typed `write_batch` call the flat
-/// arms use. Pages flush as the writer's
-/// `data_page_size_limit` / `data_page_row_count_limit` thresholds are
-/// reached — same memory-bound contract as the flat path.
-fn write_non_nullable_list_via_serialized_column_writer(
-    col_writer: &mut parquet::file::writer::SerializedColumnWriter<'_>,
-    field: &arrow::datatypes::Field,
-    array: &arrow::array::ArrayRef,
-) -> Result<(), ParquetWriteError> {
-    use arrow::array::{Array, LargeListArray, ListArray};
-    use arrow::datatypes::DataType;
-
-    if field.is_nullable() {
-        return Err(ParquetWriteError::SchemaValidation(format!(
-            "field '{}' is a nullable List; only non-nullable List is supported on the streaming \
-             write path",
-            field.name(),
-        )));
-    }
-
-    // Resolve inner field + values + per-row offsets uniformly across
-    // List<T> and LargeList<T>. Offsets coerce to i64 so a single
-    // function body handles both representations.
-    let (inner_field, inner_values, offsets): (
-        &arrow::datatypes::Field,
-        &arrow::array::ArrayRef,
+/// Resolve a `ListArray` or `LargeListArray` into a unified
+/// `(inner_field, inner_values, offsets)` triple. Offsets always coerce
+/// to `i64` so the caller doesn't need to branch on `List` vs
+/// `LargeList`.
+fn resolve_list_components<'a>(
+    field: &'a arrow::datatypes::Field,
+    array: &'a arrow::array::ArrayRef,
+) -> Result<
+    (
+        &'a arrow::datatypes::Field,
+        &'a arrow::array::ArrayRef,
         Vec<i64>,
-    ) = match field.data_type() {
+    ),
+    ParquetWriteError,
+> {
+    use arrow::array::{LargeListArray, ListArray};
+    use arrow::datatypes::DataType;
+    match field.data_type() {
         DataType::List(inner_field_ref) => {
             let arr = array.as_any().downcast_ref::<ListArray>().ok_or_else(|| {
                 ParquetWriteError::SchemaValidation(format!(
@@ -636,7 +634,7 @@ fn write_non_nullable_list_via_serialized_column_writer(
                 ))
             })?;
             let offsets: Vec<i64> = arr.value_offsets().iter().map(|&o| o as i64).collect();
-            (inner_field_ref.as_ref(), arr.values(), offsets)
+            Ok((inner_field_ref.as_ref(), arr.values(), offsets))
         }
         DataType::LargeList(inner_field_ref) => {
             let arr = array
@@ -649,14 +647,42 @@ fn write_non_nullable_list_via_serialized_column_writer(
                     ))
                 })?;
             let offsets: Vec<i64> = arr.value_offsets().to_vec();
-            (inner_field_ref.as_ref(), arr.values(), offsets)
-        }
-        other => {
-            return Err(ParquetWriteError::SchemaValidation(format!(
-                "internal: write_non_nullable_list called with non-list type {other:?}",
-            )));
+            Ok((inner_field_ref.as_ref(), arr.values(), offsets))
         }
-    };
+        other => Err(ParquetWriteError::SchemaValidation(format!(
+            "internal: resolve_list_components called with non-list type {other:?}",
+        ))),
+    }
+}
+
+/// Page-bounded write for `List<T>` / `LargeList<T>` with non-nullable
+/// inner element. Handles both nullable and non-nullable outer fields:
+///
+/// - **Non-nullable outer** (e.g. DDSketch `keys` / `counts`): max_def = 1, max_rep = 1.
+///   - Empty list at row → def=0, rep=0, no value.
+///   - N-element list at row → N×(def=1, rep=[0,1,1,…]) plus N values.
+/// - **Nullable outer** (schema-evolution case where the col is missing from some inputs): max_def
+///   = 2, max_rep = 1.
+///   - Outer is null at row → def=0, rep=0, no value.
+///   - Empty list at row → def=1, rep=0, no value.
+///   - N-element list at row → N×(def=2, rep=[0,1,1,…]) plus N values.
+///
+/// Pages flush as the writer's `data_page_size_limit` /
+/// `data_page_row_count_limit` thresholds are reached — same
+/// memory-bound contract as the flat path.
+fn write_list_via_serialized_column_writer(
+    col_writer: &mut parquet::file::writer::SerializedColumnWriter<'_>,
+    field: &arrow::datatypes::Field,
+    array: &arrow::array::ArrayRef,
+) -> Result<(), ParquetWriteError> {
+    use arrow::array::Array;
+
+    let outer_nullable = field.is_nullable();
+
+    // Resolve inner field + values + per-row offsets uniformly across
+    // List<T> and LargeList<T>. Offsets coerce to i64 so a single
+    // function body handles both representations.
+    let (inner_field, inner_values, offsets) = resolve_list_components(field, array)?;
 
     if inner_field.is_nullable() {
         return Err(ParquetWriteError::SchemaValidation(format!(
@@ -666,49 +692,53 @@ fn write_non_nullable_list_via_serialized_column_writer(
         )));
     }
 
-    // Walk per-row to build Dremel levels.
-    //
-    // Path: required outer group → repeated `list` → required `element`.
-    // - max_rep_level = 1 (only `list` is repeated).
-    // - max_def_level = 1 (the repeated `list` group can occur 0 times, which is how parquet
-    //   encodes an empty list; 1 marks "element present").
-    //
-    // Per row:
-    //  - empty list: emit one slot with def = 0, rep = 0, no value
-    //  - list of N elements: emit N slots, def = 1 each, rep = 0 for the first and rep = 1 for the
-    //    rest, plus N values.
+    let empty_list_def: i16 = if outer_nullable { 1 } else { 0 };
+    let element_present_def: i16 = if outer_nullable { 2 } else { 1 };
+
     let num_rows = array.len();
-    let total_present: usize = (0..num_rows)
-        .map(|row| (offsets[row + 1] - offsets[row]).max(0) as usize)
-        .sum();
-    // Each row contributes either 1 level (empty) or list_len levels.
-    let total_levels = (0..num_rows)
-        .map(|row| {
-            let len = (offsets[row + 1] - offsets[row]).max(0) as usize;
-            if len == 0 { 1 } else { len }
-        })
-        .sum::<usize>();
-    let mut def_levels: Vec<i16> = Vec::with_capacity(total_levels);
-    let mut rep_levels: Vec<i16> = Vec::with_capacity(total_levels);
+    // A null outer at row R contributes one def=0 level and no value.
+    // The inner-values gather (in `write_list_inner_values`) skips
+    // null rows entirely via this mask.
+    let null_rows: Option<Vec<bool>> = if outer_nullable {
+        Some((0..num_rows).map(|row| array.is_null(row)).collect())
+    } else {
+        None
+    };
+
+    let mut def_levels: Vec<i16> = Vec::new();
+    let mut rep_levels: Vec<i16> = Vec::new();
+    let mut total_present: usize = 0;
     for row in 0..num_rows {
-        let start = offsets[row];
-        let end = offsets[row + 1];
-        let len = (end - start).max(0) as usize;
-        if len == 0 {
+        let is_null = match null_rows.as_ref() {
+            Some(n) => n[row],
+            None => false,
+        };
+        if is_null {
+            // Null outer: one def=0 level, no value.
             def_levels.push(0);
             rep_levels.push(0);
+            continue;
+        }
+        let len = (offsets[row + 1] - offsets[row]).max(0) as usize;
+        if len == 0 {
+            def_levels.push(empty_list_def);
+            rep_levels.push(0);
         } else {
             for i in 0..len {
-                def_levels.push(1);
+                def_levels.push(element_present_def);
                 rep_levels.push(if i == 0 { 0 } else { 1 });
             }
+            total_present += len;
         }
     }
 
-    // Dispatch the inner primitive through the appropriate typed
-    // writer. Indexing iterates only the present (non-empty-list) rows
-    // — start..end ranges, walked once for the whole array — so we
-    // emit exactly `total_present` values.
+    // The inner-values dispatcher walks `start..end` per row and
+    // gathers into a contiguous vec. To do that against the original
+    // inner_values (which is shared across all rows including the
+    // null ones), we hand it the ORIGINAL per-row offsets plus a
+    // `null_rows` mask. Non-null rows contribute their full range;
+    // null rows are skipped. The dispatcher emits exactly
+    // `total_present` values.
     write_list_inner_values(
         col_writer,
         field,
@@ -718,6 +748,7 @@ fn write_non_nullable_list_via_serialized_column_writer(
         total_present,
         &def_levels,
         &rep_levels,
+        null_rows.as_deref(),
     )
 }
 
@@ -736,6 +767,11 @@ fn write_list_inner_values(
     total_present: usize,
     def_levels: &[i16],
     rep_levels: &[i16],
+    // Present for nullable-outer lists: `null_rows[row]` is true when
+    // row is null on the outer (no inner values to emit). Indexed by
+    // the same row range as `offsets`. None means "no nullable outer"
+    // (all rows are present), so the per-row check is skipped.
+    null_rows: Option<&[bool]>,
 ) -> Result<(), ParquetWriteError> {
     use arrow::array::{
         Float32Array, Float64Array, Int8Array, Int16Array, Int32Array, Int64Array, UInt8Array,
@@ -746,9 +782,19 @@ fn write_list_inner_values(
 
     // Walk the per-row [start, end) ranges once and gather the
     // present-only values into a contiguous Vec for `write_batch`.
+    // Null outer rows are skipped — their inner range is not emitted.
+    let is_null_row = |row: usize| -> bool {
+        match null_rows {
+            Some(n) => n[row],
+            None => false,
+        }
+    };
     let collect_i32 = |extract: &dyn Fn(usize) -> i32| -> Vec<i32> {
         let mut out = Vec::with_capacity(total_present);
         for row in 0..(offsets.len() - 1) {
+            if is_null_row(row) {
+                continue;
+            }
             let start = offsets[row].max(0) as usize;
             let end = offsets[row + 1].max(0) as usize;
             for i in start..end {
@@ -760,6 +806,9 @@ fn write_list_inner_values(
     let collect_i64 = |extract: &dyn Fn(usize) -> i64| -> Vec<i64> {
         let mut out = Vec::with_capacity(total_present);
         for row in 0..(offsets.len() - 1) {
+            if is_null_row(row) {
+                continue;
+            }
             let start = offsets[row].max(0) as usize;
             let end = offsets[row + 1].max(0) as usize;
             for i in start..end {
@@ -851,6 +900,9 @@ fn write_list_inner_values(
                 .unwrap();
             let mut values = Vec::with_capacity(total_present);
             for row in 0..(offsets.len() - 1) {
+                if is_null_row(row) {
+                    continue;
+                }
                 let start = offsets[row].max(0) as usize;
                 let end = offsets[row + 1].max(0) as usize;
                 for i in start..end {
@@ -870,6 +922,9 @@ fn write_list_inner_values(
                 .unwrap();
             let mut values = Vec::with_capacity(total_present);
             for row in 0..(offsets.len() - 1) {
+                if is_null_row(row) {
+                    continue;
+                }
                 let start = offsets[row].max(0) as usize;
                 let end = offsets[row + 1].max(0) as usize;
                 for i in start..end {
diff --git a/quickwit/quickwit-parquet-engine/src/storage/writer.rs b/quickwit/quickwit-parquet-engine/src/storage/writer.rs
index 1eeb87992f4..d6315d27f42 100644
--- a/quickwit/quickwit-parquet-engine/src/storage/writer.rs
+++ b/quickwit/quickwit-parquet-engine/src/storage/writer.rs
@@ -533,6 +533,8 @@ impl ParquetWriter {
         writer.write(&prepared.sorted_batch)?;
         let bytes = writer.into_inner()?.into_inner();
 
+        assert_indexing_prefix_alignment(&bytes, split_metadata, &self.sort_fields_string)?;
+
         debug!(bytes_written = bytes.len(), "completed write to bytes");
         Ok((bytes, (prepared.row_keys_proto, prepared.zonemap_regexes)))
     }
@@ -555,6 +557,7 @@ impl ParquetWriter {
         writer.write(&prepared.sorted_batch)?;
 
         let bytes_written = writer.into_inner()?.metadata()?.len();
+        assert_indexing_prefix_alignment_on_file(path, split_metadata, &self.sort_fields_string)?;
         debug!(bytes_written, "completed write to file");
         Ok((
             bytes_written,
@@ -563,6 +566,81 @@ impl ParquetWriter {
     }
 }
 
+/// Verify the just-written parquet bytes have no two row groups
+/// sharing the same composite prefix key. No-op when split metadata
+/// is absent or `rg_partition_prefix_len == 0` (no alignment claim).
+///
+/// Mirrors the read-path check in
+/// `merge::streaming::region_grouping::extract_regions_from_metadata`
+/// and the merge-output check in `streaming::output::finalize_output`
+/// — see `assert_unique_rg_prefix_keys` for the rationale.
+fn assert_indexing_prefix_alignment(
+    bytes: &[u8],
+    split_metadata: Option<&ParquetSplitMetadata>,
+    fallback_sort_fields: &str,
+) -> Result<(), ParquetWriteError> {
+    let Some(meta) = split_metadata else {
+        return Ok(());
+    };
+    if meta.rg_partition_prefix_len == 0 {
+        return Ok(());
+    }
+    let sort_fields = if meta.sort_fields.is_empty() {
+        fallback_sort_fields
+    } else {
+        meta.sort_fields.as_str()
+    };
+    let bytes_owned = bytes::Bytes::copy_from_slice(bytes);
+    let reader = parquet::file::reader::SerializedFileReader::new(bytes_owned).map_err(|e| {
+        ParquetWriteError::SchemaValidation(format!(
+            "post-write prefix alignment re-parse failed: {e}"
+        ))
+    })?;
+    use parquet::file::reader::FileReader;
+    crate::merge::streaming::region_grouping::assert_unique_rg_prefix_keys(
+        reader.metadata(),
+        sort_fields,
+        meta.rg_partition_prefix_len,
+        "indexing write_to_bytes",
+    )
+    .map_err(|e| ParquetWriteError::SchemaValidation(e.to_string()))
+}
+
+/// File-backed counterpart to `assert_indexing_prefix_alignment` —
+/// re-opens the just-written file and verifies the same invariant.
+fn assert_indexing_prefix_alignment_on_file(
+    path: &Path,
+    split_metadata: Option<&ParquetSplitMetadata>,
+    fallback_sort_fields: &str,
+) -> Result<(), ParquetWriteError> {
+    let Some(meta) = split_metadata else {
+        return Ok(());
+    };
+    if meta.rg_partition_prefix_len == 0 {
+        return Ok(());
+    }
+    let sort_fields = if meta.sort_fields.is_empty() {
+        fallback_sort_fields
+    } else {
+        meta.sort_fields.as_str()
+    };
+    let file = File::open(path)?;
+    let reader = parquet::file::reader::SerializedFileReader::new(file).map_err(|e| {
+        ParquetWriteError::SchemaValidation(format!(
+            "post-write prefix alignment re-open failed for {}: {e}",
+            path.display()
+        ))
+    })?;
+    use parquet::file::reader::FileReader;
+    crate::merge::streaming::region_grouping::assert_unique_rg_prefix_keys(
+        reader.metadata(),
+        sort_fields,
+        meta.rg_partition_prefix_len,
+        &format!("indexing write_to_file {}", path.display()),
+    )
+    .map_err(|e| ParquetWriteError::SchemaValidation(e.to_string()))
+}
+
 /// Parse a sort fields string and resolve column names to physical `ParquetField`s.
 ///
 /// Columns not present in the current schema (e.g., `timeseries_id`) are silently
@@ -1499,4 +1577,103 @@ mod tests {
 
         std::fs::remove_file(&path).ok();
     }
+
+    /// Indexing-path strong invariant: when the writer is asked to
+    /// emit a file with `rg_partition_prefix_len > 0`, no two row
+    /// groups may share the same composite prefix key. Forcing a
+    /// small row_group_size on a batch where every row has the same
+    /// metric_name produces multi-RG output with all RGs sharing the
+    /// prefix value — and `write_to_bytes` must reject the result
+    /// rather than land a corrupt file. Mirrors the merge-read-side
+    /// check in `extract_regions_from_metadata`.
+    #[test]
+    fn test_write_to_bytes_rejects_duplicate_rg_prefix_when_claimed_aligned() {
+        use crate::split::{ParquetSplitId, TimeRange};
+
+        let config = ParquetWriterConfig::default().with_row_group_size(3);
+        let writer = ParquetWriter::new(config, &TableConfig::default()).unwrap();
+
+        // 9 rows of cpu.usage at 3 rows per RG → 3 RGs, all with the
+        // same metric_name. Claiming rg_partition_prefix_len = 1 makes
+        // every RG share the prefix key, which must fail.
+        let batch = create_test_batch_with_tags(9, &["service", "env"]);
+        let split_metadata = ParquetSplitMetadata::metrics_builder()
+            .split_id(ParquetSplitId::new("dup-prefix-rejection-test"))
+            .index_uid("test-index:00000000000000000000000000")
+            .time_range(TimeRange::new(100, 200))
+            .sort_fields("metric_name|service|env|timeseries_id|timestamp_secs/V2")
+            .rg_partition_prefix_len(1)
+            .build();
+
+        let err = writer
+            .write_to_bytes(&batch, Some(&split_metadata))
+            .expect_err("must reject duplicate prefix RGs at indexing write");
+        let msg = err.to_string();
+        assert!(
+            msg.contains("shares a prefix key with an earlier row group"),
+            "expected duplicate-prefix error, got: {msg}",
+        );
+        assert!(
+            msg.contains("indexing"),
+            "error should mention the indexing context, got: {msg}",
+        );
+    }
+
+    /// File-backed counterpart: `write_to_file_with_metadata` must
+    /// fire the same check. The output file exists on disk by the
+    /// time we re-parse, but the writer surfaces the violation rather
+    /// than reporting success.
+    #[test]
+    fn test_write_to_file_rejects_duplicate_rg_prefix_when_claimed_aligned() {
+        use crate::split::{ParquetSplitId, TimeRange};
+
+        let config = ParquetWriterConfig::default().with_row_group_size(3);
+        let writer = ParquetWriter::new(config, &TableConfig::default()).unwrap();
+
+        let batch = create_test_batch_with_tags(9, &["service", "env"]);
+        let split_metadata = ParquetSplitMetadata::metrics_builder()
+            .split_id(ParquetSplitId::new("dup-prefix-rejection-file-test"))
+            .index_uid("test-index:00000000000000000000000000")
+            .time_range(TimeRange::new(100, 200))
+            .sort_fields("metric_name|service|env|timeseries_id|timestamp_secs/V2")
+            .rg_partition_prefix_len(1)
+            .build();
+
+        let temp_dir = std::env::temp_dir();
+        let path = temp_dir.join("test_dup_prefix_rejection.parquet");
+        let err = writer
+            .write_to_file_with_metadata(&batch, &path, Some(&split_metadata))
+            .expect_err("must reject duplicate prefix RGs at indexing write_to_file");
+        let msg = err.to_string();
+        assert!(
+            msg.contains("shares a prefix key with an earlier row group"),
+            "expected duplicate-prefix error, got: {msg}",
+        );
+
+        std::fs::remove_file(&path).ok();
+    }
+
+    /// Companion to the duplicate-rejection tests: a single-RG file
+    /// vacuously satisfies any prefix_len claim. The writer must
+    /// accept it without raising the duplicate-prefix error.
+    #[test]
+    fn test_write_to_bytes_accepts_single_rg_with_prefix_len_one() {
+        use crate::split::{ParquetSplitId, TimeRange};
+
+        let config = ParquetWriterConfig::default();
+        let writer = ParquetWriter::new(config, &TableConfig::default()).unwrap();
+
+        let batch = create_test_batch_with_tags(20, &["service", "env"]);
+        let split_metadata = ParquetSplitMetadata::metrics_builder()
+            .split_id(ParquetSplitId::new("single-rg-test"))
+            .index_uid("test-index:00000000000000000000000000")
+            .time_range(TimeRange::new(100, 200))
+            .sort_fields("metric_name|service|env|timeseries_id|timestamp_secs/V2")
+            .rg_partition_prefix_len(1)
+            .build();
+
+        writer
+            .write_to_bytes(&batch, Some(&split_metadata))
+            .expect("single-RG must pass the prefix-alignment check");
+    }
 }