Skip to content

fix(workflows): in-place updates, deletion-forgery fix, webhook-secret & channel immutability#1340

Open
AaronGoldsmith wants to merge 4 commits into
block:mainfrom
AaronGoldsmith:bugfix/workflow-update-delete
Open

fix(workflows): in-place updates, deletion-forgery fix, webhook-secret & channel immutability#1340
AaronGoldsmith wants to merge 4 commits into
block:mainfrom
AaronGoldsmith:bugfix/workflow-update-delete

Conversation

@AaronGoldsmith

@AaronGoldsmith AaronGoldsmith commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Workflows that were edited or deleted kept running concurrently. The root cause: the relay called create_workflow, which generated a random DB primary key on every publish, so each edit inserted a new active row (the old one kept firing) and client-side deletions never matched a row.

This PR makes the client-supplied UUID (the Nostr parameterized-replaceable-event d tag) the stable primary key, so edits update in place and deletions resolve correctly. It also fixes three security/stability issues found in review:

  1. In-place updateshandle_workflow_def resolves the existing row by UUID and calls update_workflow instead of unconditionally inserting. No more duplicate concurrent cron jobs.
  2. Deletion forgery (security) — NIP-09 a-tag deletion previously only checked the actor matched the coordinate pubkey, then deleted by UUID without confirming the DB row's owner. A forged 30620:<attacker>:<victim_uuid> could delete another user's workflow. Now the row is loaded and owner_pubkey is verified before deleting.
  3. Webhook secret preservation — editing a webhook workflow regenerated its _webhook_secret, breaking existing callers. The existing secret is now extracted from the stored row and re-injected on update.
  4. Channel immutability & info-leak prevention — channel membership is checked before any DB lookup (so unauthorized users can't probe for valid UUIDs via differing errors), and record.channel_id != Some(channel_id) rejects moving a workflow between channels, including the legacy NULL-channel case.
  5. NIP-33 stale-write protection on the command path — workflow-def events are routed to the command executor and never reached replace_parameterized_event, so an older event delivered after a newer one (reconnect/retry) could overwrite the row with stale YAML. The lock + (created_at, id) dominance check + soft-delete is now extracted into a shared buzz_db::nip33_stale_write_guard, called by both replace_parameterized_event and persist_command_event so the two paths can't drift. Dominated events become idempotent duplicates and skip the mutation.

Changes

  • buzz-db/src/workflow.rscreate_workflow now takes id: Uuid instead of generating one internally.
  • buzz-db/src/lib.rs — extracted nip33_stale_write_guard(&mut conn, …); replace_parameterized_event now calls it (behavior unchanged) so the command path can share it.
  • buzz-relay/src/handlers/command_executor.rshandle_workflow_def parses the d-tag UUID, looks up any existing row (after the membership gate), enforces owner + channel-immutability, preserves the webhook secret, and routes to update_workflow vs create_workflow. persist_command_event applies the shared NIP-33 stale-write guard for parameterized-replaceable command kinds.
  • buzz-relay/src/handlers/side_effects.rshandle_a_tag_deletion loads the workflow and asserts owner_pubkey before deleting; missing rows are logged and treated as a no-op.
  • buzz-test-client/tests/e2e_workflows.rstest_workflow_update_and_delete expanded to assert in-place update, webhook-secret stability, channel-change rejection, and deletion-forgery rejection (asserting the row persists).

Verification

cargo fmt --check                       # clean
cargo clippy (relay, db, test-client)   # no warnings
just test-unit                          # all passed

# Integration (against a relay built from THIS branch):
BUZZ_BIND_ADDR=0.0.0.0:3001 RELAY_URL=ws://localhost:3001 \
  BUZZ_HEALTH_PORT=8081 BUZZ_METRICS_PORT=9103 cargo run -p buzz-relay
RELAY_URL=ws://localhost:3001 \
  cargo test -p buzz-test-client --test e2e_workflows \
  test_workflow_update_and_delete -- --ignored
# test test_workflow_update_and_delete ... ok

Manual before/after: a 1-minute cron edited to v2/v3 previously left multiple duplicate jobs firing and could not be deleted; after the fix only the latest definition fires and a NIP-09 deletion stops it immediately.

Reviewer note — pre-existing duplication left out of scope

While extracting nip33_stale_write_guard, I deduped the two NIP-33 replacement copies the reviewer flagged (replace_parameterized_event + the new command path). A third copy of the FNV-1a advisory-lock-key snippet still lives in Db::replace_addressable_event (relay-signed NIP-29 group metadata, kinds 39000–39002). It predates this PR (present on main) and was intentionally left untouched here:

  • It keys on channel_id (the relay is the author, so pubkey isn't the natural key), so it isn't a drop-in for the same helper.
  • Folding it in would widen this PR's blast radius beyond the workflow fix.

Flagging it as a good standalone follow-up (factor the lock-key into a shared fnv_coordinate_lock_key and have all three call it).

…ent UUID

Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
…nd enforce channel immutability on updates

Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
…hannel_id update scenario

Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>

@AaronGoldsmith AaronGoldsmith left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: flagging the non-obvious / higher-blast-radius bits so they're easy to scrutinize.

match state.db.get_workflow(wf_id).await {
Ok(wf) => {
if wf.owner_pubkey != owner_bytes {
return Err(anyhow::anyhow!("forbidden: deletion owner mismatch"));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security (deletion forgery). NIP-09 a-tag deletion previously only checked the actor matched the coordinate pubkey, then deleted by UUID. A forged 30620:<attacker>:<victim_uuid> could delete another user's workflow. We now load the row and assert owner_pubkey before deleting. This extra lookup is the ownership gate, not a redundant query.

}

// Check if workflow already exists to perform update or create checks
let existing = state.db.get_workflow(workflow_id).await;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Ordering is load-bearing (info-leak). The is_member_cached gate above (line 580) runs before this lookup on purpose, so unauthorized users can't probe for valid workflow UUIDs via differing error responses. Please don't hoist this lookup earlier for efficiency — it silently reintroduces the leak.

"forbidden: cannot update a workflow owned by another user".into(),
));
}
if record.channel_id != Some(channel_id) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛓️ Channel immutability + legacy NULL. record.channel_id != Some(channel_id) is deliberate: it rejects both some→some channel moves and the legacy NULL→some case. A naive unwrap/compare would let legacy global workflows be silently re-scoped.

let secret = webhook_secret::generate_webhook_secret();
webhook_secret::inject_secret(&mut definition_json, &secret);
Some(secret)
let existing_secret = existing_record

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔑 Webhook secret preservation. On update we re-inject the existing secret from the stored row (returning None so it isn't rotated/re-surfaced). Previously every edit regenerated _webhook_secret, breaking existing webhook callers.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e34ed4b9e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +657 to +661
state
.db
.update_workflow(workflow_id, &workflow_name, &definition_json_final, &hash)
.await
.map_err(|e| IngestError::Internal(format!("error: db update_workflow: {e}")))?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore stale workflow definition replays

When an older kind:30620 event for the same workflow is delivered after a newer one (for example after a reconnect/retry, while still inside the relay's timestamp window), this unconditional update_workflow overwrites the workflows table with the stale YAML. The command path uses persist_command_event rather than the NIP-33 replace_parameterized_event stale-write check, so clients will still see the newer event as the latest while the workflow engine reads and runs the older definition from the DB. Please compare the incoming event's (created_at, id) against the current live coordinate before applying the update, or route workflow defs through the same replacement logic.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Addressed in efa13ed.

The command path now goes through the same NIP-33 replacement logic as the normal store path. Rather than duplicate the check, I extracted the lock + (created_at, id) dominance test + soft-delete into a shared buzz_db::nip33_stale_write_guard(&mut conn, …) and call it from both Db::replace_parameterized_event and persist_command_event, so the two paths cannot drift.

  • A dominated (older/retried) event now returns PersistResult::Duplicate, so the handler skips update_workflow entirely — the stale YAML never reaches the DB, and clients no longer see a newer event than the engine runs.
  • Among command kinds, only KIND_WORKFLOW_DEF (30620) is parameterized-replaceable (DM/approval kinds are 41xxx/46xxx), so behavior is unchanged for the rest.
  • Covered by test_workflow_update_and_delete (passing against a branch-built relay). Note: Nostr created_at is second-granularity, so same-second edits are resolved by the id tie-break — documented inline where the test sleeps 1s to stay deterministic.

👍 Useful catch — thanks.

Workflow def events (kind 30620) are routed to the command executor and
never reach Db::replace_parameterized_event, so persist_command_event
inserted them with plain ON CONFLICT DO NOTHING and no (created_at, id)
replacement check. An older event delivered after a newer one (reconnect
/retry within the timestamp window) was inserted and applied, overwriting
the workflows row with stale YAML while clients still saw the newer event
as latest.

Extract the lock + dominance check + soft-delete into a single shared
buzz_db::nip33_stale_write_guard(&mut conn, ...) and call it from both
Db::replace_parameterized_event and persist_command_event, so the command
and normal store paths apply identical NIP-33 replacement rules and cannot
drift. A dominated event is reported as an idempotent duplicate, so the
handler skips its mutation.

Only KIND_WORKFLOW_DEF among command kinds is parameterized-replaceable
(DM/approval kinds are 41xxx/46xxx), so behavior is unchanged for the rest.

Document in the e2e test why a 1s sleep is required: created_at is
second-granularity and same-second edits are resolved by the id tie-break.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
@AaronGoldsmith AaronGoldsmith force-pushed the bugfix/workflow-update-delete branch from 8889232 to efa13ed Compare June 29, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant