fix(workflows): in-place updates, deletion-forgery fix, webhook-secret & channel immutability#1340
Conversation
…ent UUID Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
…nd enforce channel immutability on updates Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
…hannel_id update scenario Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
AaronGoldsmith
left a comment
There was a problem hiding this comment.
Self-review: flagging the non-obvious / higher-blast-radius bits so they're easy to scrutinize.
| match state.db.get_workflow(wf_id).await { | ||
| Ok(wf) => { | ||
| if wf.owner_pubkey != owner_bytes { | ||
| return Err(anyhow::anyhow!("forbidden: deletion owner mismatch")); |
There was a problem hiding this comment.
🔒 Security (deletion forgery). NIP-09 a-tag deletion previously only checked the actor matched the coordinate pubkey, then deleted by UUID. A forged 30620:<attacker>:<victim_uuid> could delete another user's workflow. We now load the row and assert owner_pubkey before deleting. This extra lookup is the ownership gate, not a redundant query.
| } | ||
|
|
||
| // Check if workflow already exists to perform update or create checks | ||
| let existing = state.db.get_workflow(workflow_id).await; |
There was a problem hiding this comment.
🔍 Ordering is load-bearing (info-leak). The is_member_cached gate above (line 580) runs before this lookup on purpose, so unauthorized users can't probe for valid workflow UUIDs via differing error responses. Please don't hoist this lookup earlier for efficiency — it silently reintroduces the leak.
| "forbidden: cannot update a workflow owned by another user".into(), | ||
| )); | ||
| } | ||
| if record.channel_id != Some(channel_id) { |
There was a problem hiding this comment.
⛓️ Channel immutability + legacy NULL. record.channel_id != Some(channel_id) is deliberate: it rejects both some→some channel moves and the legacy NULL→some case. A naive unwrap/compare would let legacy global workflows be silently re-scoped.
| let secret = webhook_secret::generate_webhook_secret(); | ||
| webhook_secret::inject_secret(&mut definition_json, &secret); | ||
| Some(secret) | ||
| let existing_secret = existing_record |
There was a problem hiding this comment.
🔑 Webhook secret preservation. On update we re-inject the existing secret from the stored row (returning None so it isn't rotated/re-surfaced). Previously every edit regenerated _webhook_secret, breaking existing webhook callers.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e34ed4b9e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| state | ||
| .db | ||
| .update_workflow(workflow_id, &workflow_name, &definition_json_final, &hash) | ||
| .await | ||
| .map_err(|e| IngestError::Internal(format!("error: db update_workflow: {e}")))?; |
There was a problem hiding this comment.
Ignore stale workflow definition replays
When an older kind:30620 event for the same workflow is delivered after a newer one (for example after a reconnect/retry, while still inside the relay's timestamp window), this unconditional update_workflow overwrites the workflows table with the stale YAML. The command path uses persist_command_event rather than the NIP-33 replace_parameterized_event stale-write check, so clients will still see the newer event as the latest while the workflow engine reads and runs the older definition from the DB. Please compare the incoming event's (created_at, id) against the current live coordinate before applying the update, or route workflow defs through the same replacement logic.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🤖 Addressed in efa13ed.
The command path now goes through the same NIP-33 replacement logic as the normal store path. Rather than duplicate the check, I extracted the lock + (created_at, id) dominance test + soft-delete into a shared buzz_db::nip33_stale_write_guard(&mut conn, …) and call it from both Db::replace_parameterized_event and persist_command_event, so the two paths cannot drift.
- A dominated (older/retried) event now returns
PersistResult::Duplicate, so the handler skipsupdate_workflowentirely — the stale YAML never reaches the DB, and clients no longer see a newer event than the engine runs. - Among command kinds, only
KIND_WORKFLOW_DEF(30620) is parameterized-replaceable (DM/approval kinds are 41xxx/46xxx), so behavior is unchanged for the rest. - Covered by
test_workflow_update_and_delete(passing against a branch-built relay). Note: Nostrcreated_atis second-granularity, so same-second edits are resolved by the id tie-break — documented inline where the test sleeps 1s to stay deterministic.
👍 Useful catch — thanks.
Workflow def events (kind 30620) are routed to the command executor and never reach Db::replace_parameterized_event, so persist_command_event inserted them with plain ON CONFLICT DO NOTHING and no (created_at, id) replacement check. An older event delivered after a newer one (reconnect /retry within the timestamp window) was inserted and applied, overwriting the workflows row with stale YAML while clients still saw the newer event as latest. Extract the lock + dominance check + soft-delete into a single shared buzz_db::nip33_stale_write_guard(&mut conn, ...) and call it from both Db::replace_parameterized_event and persist_command_event, so the command and normal store paths apply identical NIP-33 replacement rules and cannot drift. A dominated event is reported as an idempotent duplicate, so the handler skips its mutation. Only KIND_WORKFLOW_DEF among command kinds is parameterized-replaceable (DM/approval kinds are 41xxx/46xxx), so behavior is unchanged for the rest. Document in the e2e test why a 1s sleep is required: created_at is second-granularity and same-second edits are resolved by the id tie-break. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
8889232 to
efa13ed
Compare
Summary
Workflows that were edited or deleted kept running concurrently. The root cause: the relay called
create_workflow, which generated a random DB primary key on every publish, so each edit inserted a new active row (the old one kept firing) and client-side deletions never matched a row.This PR makes the client-supplied UUID (the Nostr parameterized-replaceable-event
dtag) the stable primary key, so edits update in place and deletions resolve correctly. It also fixes three security/stability issues found in review:handle_workflow_defresolves the existing row by UUID and callsupdate_workflowinstead of unconditionally inserting. No more duplicate concurrent cron jobs.a-tag deletion previously only checked the actor matched the coordinate pubkey, then deleted by UUID without confirming the DB row's owner. A forged30620:<attacker>:<victim_uuid>could delete another user's workflow. Now the row is loaded andowner_pubkeyis verified before deleting._webhook_secret, breaking existing callers. The existing secret is now extracted from the stored row and re-injected on update.record.channel_id != Some(channel_id)rejects moving a workflow between channels, including the legacyNULL-channel case.replace_parameterized_event, so an older event delivered after a newer one (reconnect/retry) could overwrite the row with stale YAML. The lock +(created_at, id)dominance check + soft-delete is now extracted into a sharedbuzz_db::nip33_stale_write_guard, called by bothreplace_parameterized_eventandpersist_command_eventso the two paths can't drift. Dominated events become idempotent duplicates and skip the mutation.Changes
buzz-db/src/workflow.rs—create_workflownow takesid: Uuidinstead of generating one internally.buzz-db/src/lib.rs— extractednip33_stale_write_guard(&mut conn, …);replace_parameterized_eventnow calls it (behavior unchanged) so the command path can share it.buzz-relay/src/handlers/command_executor.rs—handle_workflow_defparses thed-tag UUID, looks up any existing row (after the membership gate), enforces owner + channel-immutability, preserves the webhook secret, and routes toupdate_workflowvscreate_workflow.persist_command_eventapplies the shared NIP-33 stale-write guard for parameterized-replaceable command kinds.buzz-relay/src/handlers/side_effects.rs—handle_a_tag_deletionloads the workflow and assertsowner_pubkeybefore deleting; missing rows are logged and treated as a no-op.buzz-test-client/tests/e2e_workflows.rs—test_workflow_update_and_deleteexpanded to assert in-place update, webhook-secret stability, channel-change rejection, and deletion-forgery rejection (asserting the row persists).Verification
Manual before/after: a 1-minute cron edited to v2/v3 previously left multiple duplicate jobs firing and could not be deleted; after the fix only the latest definition fires and a NIP-09 deletion stops it immediately.
Reviewer note — pre-existing duplication left out of scope
While extracting
nip33_stale_write_guard, I deduped the two NIP-33 replacement copies the reviewer flagged (replace_parameterized_event+ the new command path). A third copy of the FNV-1a advisory-lock-key snippet still lives inDb::replace_addressable_event(relay-signed NIP-29 group metadata, kinds 39000–39002). It predates this PR (present onmain) and was intentionally left untouched here:channel_id(the relay is the author, so pubkey isn't the natural key), so it isn't a drop-in for the same helper.Flagging it as a good standalone follow-up (factor the lock-key into a shared
fnv_coordinate_lock_keyand have all three call it).