refactor(dataset-raw): simplify RawTableWriter logic by sistemd · Pull Request #2089 · edgeandnode/amp

sistemd · 2026-04-06T13:01:16Z

Motivation

While syncing the Solana chain, I encountered this panic when resuming a stopped stream:

worker-1  | 2026-03-27T00:44:08.252459Z ERROR worker::service: job failed with fatal error node_id=dump job_id=1 error=Failed to materialize raw dataset error_source=["Partition task failed", "task 236 panicked with message \"called `Option::unwrap()` on a `None` value\""]

with the panicking unwrap being this one.

A failing scenario

The panic can only happen with providers that stream non-contiguous blocks (e.g. skipped slots in Solana). Here's a scenario in which it occurs (slot numbers are made up for the purposes of the example):

Sync runs on range [0..=10]
Slot 0 is skipped (no block is streamed for it)
Slots [1..=10] are streamed and the stream ends
Sync runs on range [0..=10] again, only slot 0 is missing, range is narrowed to [0..=0]
First streamed slot is slot 1, it is outside of the current segment (range [0..=0])
Segment needs to be closed, panic happens

Cause

The RawTableWriter implementation opens files eagerly for each segment, once upon creation and then once for each new range or after a flush (exceeded size, time limit etc). In those scenarios, the current_file field gets set to Some but nothing is written into the file yet and the current_range field is still None (because it is updated after a write).

This works when all streamed blocks are contiguous because, eventually, the block stream will reach the current segment's range and something will be written into the file (current_range will be set to Some). However, for providers that stream non-contiguous blocks, that is not necessarily true (as described in the scenario above).

Solution

The main problem with the current approach is that segments are open eagerly, which leaves a time window where an empty segment exists. Therefore, the solution presented in this PR is to open segments lazily, whenever we have something to write into them. In addition, the PR introduced a RawTableSegment abstraction that encapsulates the segment file and block range into a single field, instead of having them as two separate fields on the RawTableWriter.

leoyvens

The changes to RawTableWriter seem like an improvement, even if we're missing a integration-level test of this situation being fixed.

The change to the higher level loop is something we need to further discuss.

leoyvens · 2026-04-06T18:22:30Z

crates/job-materialize-datasets-raw/src/job_impl.rs

+                    ?missing_dataset_ranges,
+                    "terminating infinite materialize loop"
+                );
+                break;


This break is very suspicious. This loop is meant to be infinite, breaking would entirely stop the extraction procedure. Are you entirely sure you need this?

I needed something to prevent busy loops in the ongoing Solana sync without manual intervention on my part. But I agree, it is probably too much of a hack. What do you think about this (at least until a better solution, perhaps the one you suggested): d78db51

leoyvens · 2026-04-06T18:31:47Z

crates/job-materialize-datasets-raw/src/job_impl/writer.rs

+    async fn write(&mut self, table_rows: &TableRows) -> Result<(), ParquetError> {
+        assert_eq!(&self.range.network, &table_rows.range.network);
+        self.file.write(&table_rows.rows).await?;
+
+        let incoming_range = &table_rows.range;
+        self.range.numbers = self.range.start()..=incoming_range.end();
+        self.range.hash = incoming_range.hash;
+        self.range.timestamp = incoming_range.timestamp;
+
+        Ok(())


This is a good callout by the AI. Reconstructing the BlockRange is the more robust pattern, if a new field is added to BlockRange that will generate a compiler error, preventing the new field from silently introducing a bug here.

leoyvens · 2026-04-06T18:57:01Z

This sequence you describe is very helpful to understand the problem:

Slot 0 is skipped (no block is streamed for it)
Slots [1..=10] are streamed and the stream ends
Sync runs on range [0..=10] again, only slot 0 is missing, range is narrowed to [0..=0]

It jumps out that we would run 0 again, when we should know it has been skipped. Could we change fn missing_ranges to detect known-skipped numbers?
Given that function has the segment list, it could use prev_hash relationships to know which block numbers were skipped between two segments. We probably shouldn't rely on this being perfect but as a best effort could help avoid busy loops.

cc @Theodus who worked on supporting missing block numbers.

Adds a `RawTableSegment` struct to encapsulate data related to a segment that is being written to. Changes the `RawTableWriter` to open segments lazily, whenever they are about to be written to. This ensures that no empty segments (i.e. files) will end up on disk. Refactors the `RawTableWriter` write and close logic to have less branching which should make it easier to follow.

Adds a break condition to the materialize loop to prevent it from going infinite. This only manifests itself with providers that stream non-contiguous blocks (e.g. skipped slots in Solana).

Adds an integration test which ensures that the raw dataset writer correctly handles skipped segments (segments where no blocks were streamed for the entire range). Based on the Solana extractor, since it is the only one that can skip blocks (i.e. slots).

Instead of breaking out of the materialize loop (which is supposed to be infinite), keep track of unreachable ranges and filter them out before running each materialize iteration. This allows the loop to keep tracking the latest block but also prevents it infinite from going infinite due to skipped (non-existent) ranges.

leoyvens · 2026-04-09T15:18:48Z

crates/job-materialize-datasets-raw/src/job_impl.rs

+            // Find the union of missing table block ranges. Then subtract ranges that
+            // were missing in the previous iteration as well. These are segments that
+            // can never be materialized because they do not exist (e.g. skipped slot
+            // ranges in Solana).


So the assumption is: A range that is passed in to materialize_ranges but remains missing after it runs must be missing from the chain. Seems reasonable. But then I think we should move the logic to after running materialize_ranges, this would be more direct.

sistemd added the claude-review label Apr 6, 2026

This comment was marked as resolved.

Sign in to view

leoyvens reviewed Apr 6, 2026

View reviewed changes

sistemd force-pushed the sistemd/raw-dataset-writer-refactor branch 2 times, most recently from af1977f to f8a57c5 Compare April 7, 2026 21:51

sistemd requested a review from leoyvens April 7, 2026 21:57

sistemd force-pushed the sistemd/raw-dataset-writer-refactor branch from bca979d to 7090a30 Compare April 9, 2026 08:49

sistemd added 12 commits April 9, 2026 11:35

fix(dataset-raw): break out of infinite materialize loop

6353601

Adds a break condition to the materialize loop to prevent it from going infinite. This only manifests itself with providers that stream non-contiguous blocks (e.g. skipped slots in Solana).

chore(dataset-raw): add debug logs

01dc977

chore: import style

01355ea

chore: doc comment guideline violation

b9bc10e

chore: test function guideline violations

4629035

fix(dataset-raw): make new block range instead of modifying in-place

c94cde4

chore: add empty snapshot directory

ae98ab7

fmt

477221a

test: add auth fields to solana test config

9f6a487

sistemd force-pushed the sistemd/raw-dataset-writer-refactor branch from 7090a30 to 9f6a487 Compare April 9, 2026 09:36

leoyvens reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(dataset-raw): simplify RawTableWriter logic#2089

refactor(dataset-raw): simplify RawTableWriter logic#2089
sistemd wants to merge 12 commits intomainfrom
sistemd/raw-dataset-writer-refactor

sistemd commented Apr 6, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

leoyvens left a comment

Uh oh!

leoyvens Apr 6, 2026

Uh oh!

sistemd Apr 7, 2026

Uh oh!

leoyvens Apr 6, 2026

Uh oh!

leoyvens commented Apr 6, 2026

Uh oh!

leoyvens Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sistemd commented Apr 6, 2026

Motivation

A failing scenario

Cause

Solution

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

leoyvens left a comment

Choose a reason for hiding this comment

Uh oh!

leoyvens Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

sistemd Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

leoyvens Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

leoyvens commented Apr 6, 2026

Uh oh!

leoyvens Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants