[SS-63][Oneshot Sources] Prevent timely workers from dropping oneshot ingestions without controller commands by patrickwwbutler · Pull Request #35006 · MaterializeInc/materialize

patrickwwbutler · 2026-02-13T16:19:35Z

This fixes a dataflow coordination issue in which the first timely worker begins to process an ingestion, fails, and receives an error that cancels the operation very quickly. Then, in the render_completion_operator operator of the dataflow, on reception of an error result, the worker exits without exhausting its input. Then, the now-removed code would drop the ingestion, including the shutdown buttons from that worker, effectively stopping scheduling of the dataflow. This would then result in the other timely workers observing the disconnected results channel indefinitely.

By removing this early ingestion drop logic from the worker, and allowing the StorageController to drop it from all workers when the ingestion is fully completed, we can ensure that the results channel is not disconnected early.

Verification

This now passes the copy_from_s3_minio.td test with multiple replicas, and larger cluster sizes.

github-actions · 2026-02-13T16:19:45Z

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

patrickwwbutler · 2026-02-17T14:15:14Z

ran a nightly building using trigger-ci containing tests with multiple replicas and larger cluster sizes

martykulma

It looks like there's an additional cleanup of in process_oneshot_ingestion, but looks great otherwise!

src/storage/src/storage_state.rs

antiguru

I am not sure this is the right fix. Looking at #31136 which introduced the cancel protocol for one-shot ingestions, it has a design deficit that its reconciliation will be racy, so I'm doubtful that this fix will cover all cases. Commands that render a dataflow need to be executed on all workers, there is no way to avoid this. Even if a worker receives a cancel of a one-shot ingestion before it rendered it, it'll need to render because otherwise the Timely infrastructure will stop working.

antiguru

(Requesting changes to avoid merging as-is.)

patrickwwbutler · 2026-02-20T13:37:28Z

I am not sure this is the right fix. Looking at #31136 which introduced the cancel protocol for one-shot ingestions, it has a design deficit that its reconciliation will be racy, so I'm doubtful that this fix will cover all cases.

@antiguru Could you clarify this design deficit for me? It appears there was some back and forth, but ultimately it was decided to go for option 1 from Jan's comment.

It seems to be around clusters dropping with pending oneshots, but I'm not sure what the code we're deleting here would have to do with that? I think that the adapter sends a cancel command to the storage controller, which in turn sends cancel commands to the workers. Is the issue that the workers in the cluster may be dropped before receiving that cancel command? If so, I don't think this will actually affect that reconciliation process, as the workers that would have been dropped by this deleted code have already finished. Do you think this change would actually cause a regression, or just not fully address cleanup issues?

Commands that render a dataflow need to be executed on all workers, there is no way to avoid this. Even if a worker receives a cancel of a one-shot ingestion before it rendered it, it'll need to render because otherwise the Timely infrastructure will stop working.

My observation was that the StorageController would not actually send the CancelOneshotIngestion command until at least some number of the timely workers had rendered and completed (I believe related to some measure of "progress" updates?), which is actually what was causing this to hang in the first place (only one worker was finished and the rest were waiting around forever for progress from the finished worker).

What this change does is simply prevent individual workers from dropping their dataflows too early, and instead wait to be told to do so once the controller has received a sufficient result.

antiguru · 2026-02-20T14:43:18Z

I typed up something and then realized this only applies during reconciliation. This is not the case here, right? I.e., we don't have envd reconnect, correct?

What I'd like to understand is why removing the code in your PR fixes the problem. Looking at the implementation, if we receive a oneshot ingestion instruction, we always render a render_completion_operator at the end, which holds on to the tx endpoint of a channel to send along a batch. Maybe the line maybe_payload = Some(result.map_err(|e| e.to_string())?); silently swallows the storage error?

patrickwwbutler · 2026-02-20T15:22:57Z

I typed up something and then realized this only applies during reconciliation. This is not the case here, right? I.e., we don't have envd reconnect, correct?

Yes, this is not a reconciliation related change, this applies when a cluster with multiple workers hits an error while processing the data early on, and exits quickly. It's a bit unclear, but the understanding @DAlperin and I have is that by dropping ingestion from the first worker (worker_index=0) when it errors out quickly, we trigger the dataflow's shutdown button, but the other workers have not finished rendering, let alone processing the dataflow, so they end up never receiving progress messages from the first worker.

Basically we believe the early trigger on the shutdown button is causing the issue.

antiguru · 2026-02-20T15:35:45Z

we trigger the dataflow's shutdown button

The shutdown button should wait for the other workers to catch up. If it doesn't this might be a bug. (If using the Mz-internal one!)

patrickwwbutler · 2026-02-20T16:28:56Z

we trigger the dataflow's shutdown button

The shutdown button should wait for the other workers to catch up. If it doesn't this might be a bug. (If using the Mz-internal one!)

Possibly it is not waiting because the other workers haven't finished rendering in this case? I believe they are still waiting for the InternalStorageCommand that tells them to actually create the ingestion at this point, because the command is sent out by worker 0, and it then begins rendering the dataflow significantly faster, and errors out almost immediately

martykulma · 2026-02-23T13:55:47Z

Alternatively, this could publish StorageCommand::CancelOneshotIngestion via internal_cmd_tx for each id in to_remove.

patrickwwbutler · 2026-02-23T19:05:36Z

Alternatively, this could publish StorageCommand::CancelOneshotIngestion via internal_cmd_tx for each id in to_remove.

It doesn't seem like that should be necessary, is there a good reason not to just allow the controller to send the cancellation commands? I feel like simplifying the logic as much as possible here is ideal

petrosagg · 2026-02-24T15:10:36Z

I don't think the problem is in how we use the buttons. The buttons work correctly regardless of render/drop timing in in that when eventually all workers press their buttons the dataflow shuts down.

I looked a bit at the oneshot ingestion code and it seems to me that the operator rendered by render_completion_operator is the only thing that can cause the worker to observe a disconnected results channel. But that operator waits for its input to be exhausted (i.e reach the empty frontier) before sending a value down the channel and then exiting, dropping its tx side. But to observe an empty frontier on its input it means that all other workers have also rendered their oneshot dataflow and made progress, which goes against the theory that this is caused by one worker failing quickly before the other workers have rendered their part.

I think we need a better grasp of what is causing the hang, happy to hop on a call to discuss this

DAlperin · 2026-02-24T15:21:27Z

Might be worth throwing some logs on progress messages/probes around to observe the blocking behavior more explicitly

petrosagg · 2026-02-25T11:14:26Z

I looked at this with a fresh set of eyes today and found what's going on.

But that operator waits for its input to be exhausted (i.e reach the empty frontier) before sending a value down the channel and then exiting, dropping its tx side.

This turns out to be not true! Do you see it?

https://github.com/MaterializeInc/materialize/blob/main/src/storage-operators/src/oneshot_source.rs#L641-L654

There is a try (?) operator when assigning to make_payload which means that the completion operator does not actual wait for the frontier to reach the empty frontier in the case of error, and immediately disconnects the channel! This then leads to the worker dropping the shutdown button which effectively stops scheduling the dataflow. The only way out from here is if all other workers also drop their buttons but that won't happen unless the oneshot ingestion is cancelled by the user or the other workers complete, which cannot happen since the worker that observed the error has stopped being scheduled.

So another way to fix the problem would be to change the completion operator according to this diff:

     builder.build(move |_| async move {
         let result = async move {
-            let mut maybe_payload: Option<ProtoBatch> = None;
+            let mut maybe_payload: Result<Option<ProtoBatch>, String> = Ok(None);

             while let Some(event) = results_input.next().await {
                 if let AsyncEvent::Data(_cap, results) = event {
@@ -645,15 +645,15 @@ pub fn render_completion_operator<G, F>(
                         .expect("only 1 event on the result stream");

                     // TODO(cf2): Lift this restriction.
-                    if maybe_payload.is_some() {
+                    if !matches!(maybe_payload, Ok(None)) {
                         panic!("expected only one batch!");
                     }

-                    maybe_payload = Some(result.map_err(|e| e.to_string())?);
+                    maybe_payload = result.map(Some).map_err(|e| e.to_string());
                 }
             }

-            Ok(maybe_payload)
+            maybe_payload
         }
         .await;

That said, I think the PR as-is is the preferred route. We don't want clusters to make independent decisions about which dataflows get cancelled, this is the job of the controller. This way we are also free to write any operator logic we want, including early exiting etc without worrying about deadlocks.

Under this light, I think we should merge the PR but we should update the PR description to reflect this analysis so that it gets recorded in the git log.

misc/fivetran-sdk

Concerns were addressed after discussion with Petros and Moritz confirmed with me that he's okay with merging now

patrickwwbutler requested review from a team as code owners February 13, 2026 16:19

patrickwwbutler requested a review from SangJunBak February 13, 2026 16:19

patrickwwbutler requested a review from a team February 13, 2026 16:19

patrickwwbutler force-pushed the patrick/oneshot-replica-support branch from ddcfaba to 9b38777 Compare February 19, 2026 18:29

patrickwwbutler requested review from a team, ggevay and teskje as code owners February 19, 2026 19:27

patrickwwbutler force-pushed the patrick/oneshot-replica-support branch 2 times, most recently from 9b38777 to c5136e3 Compare February 19, 2026 19:41

martykulma approved these changes Feb 19, 2026

View reviewed changes

src/storage/src/storage_state.rs Show resolved Hide resolved

antiguru reviewed Feb 20, 2026

View reviewed changes

antiguru previously requested changes Feb 20, 2026

View reviewed changes

antiguru self-requested a review February 20, 2026 14:43

patrickwwbutler force-pushed the patrick/oneshot-replica-support branch from c5136e3 to 1f8b8ff Compare February 20, 2026 16:33

patrickwwbutler force-pushed the patrick/oneshot-replica-support branch from 1f8b8ff to ec50e78 Compare February 23, 2026 17:00

DAlperin reviewed Feb 25, 2026

View reviewed changes

misc/fivetran-sdk Outdated Show resolved Hide resolved

patrickwwbutler added 3 commits February 25, 2026 16:07

delete worker-level removal of oneshot ingestions

1858a56

deleted now-unused to_remove vec

8bad0db

delete now-unused submodule

32bf981

patrickwwbutler force-pushed the patrick/oneshot-replica-support branch from ec50e78 to 32bf981 Compare February 25, 2026 21:16

patrickwwbutler changed the title ~~[SS-63][Oneshot Sources] Fix race condition in Oneshot Ingestion related to multiple Timely Workers~~ [SS-63][Oneshot Sources] Prevent timely workers from dropping ingestions without controller commands Feb 25, 2026

patrickwwbutler merged commit a52646c into MaterializeInc:main Feb 26, 2026
135 checks passed

patrickwwbutler changed the title ~~[SS-63][Oneshot Sources] Prevent timely workers from dropping ingestions without controller commands~~ [SS-63][Oneshot Sources] Prevent timely workers from dropping oneshot ingestions without controller commands Feb 26, 2026

Conversation

patrickwwbutler commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verification

Uh oh!

github-actions bot commented Feb 13, 2026 • edited by patrickwwbutler Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-merge checklist

Uh oh!

patrickwwbutler commented Feb 17, 2026

Uh oh!

martykulma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

patrickwwbutler commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antiguru commented Feb 20, 2026

Uh oh!

patrickwwbutler commented Feb 20, 2026

Uh oh!

antiguru commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickwwbutler commented Feb 20, 2026

Uh oh!

martykulma commented Feb 23, 2026

Uh oh!

patrickwwbutler commented Feb 23, 2026

Uh oh!

petrosagg commented Feb 24, 2026

Uh oh!

DAlperin commented Feb 24, 2026

Uh oh!

petrosagg commented Feb 25, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

patrickwwbutler commented Feb 13, 2026 •

edited

Loading

github-actions bot commented Feb 13, 2026 •

edited by patrickwwbutler

Loading

patrickwwbutler commented Feb 20, 2026 •

edited

Loading

antiguru commented Feb 20, 2026 •

edited

Loading