Skip to content

feat(metadata-db): add trigger evaluation queries and next_fire_at column#2102

Closed
shiyasmohd wants to merge 1 commit intomainfrom
shiyasmohd/job-trigger-queries
Closed

feat(metadata-db): add trigger evaluation queries and next_fire_at column#2102
shiyasmohd wants to merge 1 commit intomainfrom
shiyasmohd/job-trigger-queries

Conversation

@shiyasmohd
Copy link
Copy Markdown
Contributor

  • Migration adding next_fire_at column to jobs_status with a filtered index
  • get_jobs_for_trigger_evaluation — fetches jobs where next_fire_at <= now()
  • update_next_fire_at — sets/clears the next fire time for periodic triggers
  • get_attempt_count_since_last_completed — counts scheduling attempts since last completed run

@shiyasmohd shiyasmohd self-assigned this Apr 7, 2026
@shiyasmohd shiyasmohd requested a review from LNSD April 7, 2026 07:59
@LNSD
Copy link
Copy Markdown
Contributor

LNSD commented Apr 7, 2026

I believe that with the current tables, we can implement a trigger mechanism. I am thinking of something like this:

Trigger Evaluation Design

Overview

The controller holds an in-memory map of active periodic job timers. Each timer ticks at the job's trigger cadence. When a timer fires, the controller reschedules the job. No persistent scheduling state is needed beyond the existing job event log.

Data Structure

A map of JobId to a timer entry. Each entry holds:

  • A tokio::time::Interval configured with the trigger's period (or hold a stream for the cron case)
  • The trigger config (parsed from the descriptor, to avoid re-fetching on each tick)
  • The job's created_at (needed for interval anchor computation)

Boot Recovery

On startup, the controller runs a single query to find all jobs that:

  • Are in a terminal state (COMPLETED, ERROR, FATAL)
  • Have a periodic trigger in their descriptor (interval or cron)

For each, it parses the trigger config, computes the next fire time via Trigger::next_fire_time(now, created_at), and creates an Interval starting at that time with the appropriate period. If the next fire time is already in the past (the controller was down and missed ticks), the interval fires immediately on the first poll.

Steady-State Loop

A single async task drives all timers. On each fire:

  1. Look up the job's current status
  2. If ERROR: mark FATAL first (abandon current retry cycle), then reschedule fresh
  3. If COMPLETED or FATAL: reschedule directly (new SCHEDULED event)
  4. Compute the next fire time and let the interval tick again

The loop uses something like FuturesUnordered or a select!-based approach over the map entries, waking only when the next timer is due. Zero work between fires.

Lifecycle Events

Job completes/fails (reaches terminal state):

  • If the job has a periodic trigger and isn't already in the map, compute next fire time and insert a new timer entry

Job is rescheduled (re-deploy, config change):

  • Remove the old timer entry
  • Parse the new trigger config from the updated descriptor
  • Compute next fire time and insert a new timer entry
  • This handles trigger cadence changes (e.g., interval changed from 5min to 10min)

Job is deleted or stopped:

  • Remove the timer entry from the map

Job has a one-shot trigger:

  • Never enters the map. One-shot jobs are unaffected by this system.

What This Replaces

  • No next_fire_at column on jobs_status
  • No migration
  • No update_next_fire_at mutation
  • No polling query on every reconciliation tick
  • The get_jobs_for_trigger_evaluation query becomes a boot-only query (and can use the JSONB containment filter since it only runs once)

Invariants

  • The map is derived state -- it's fully reconstructable from the event log at any time. A crash and restart rebuilds it from scratch.
  • The map only contains periodic jobs in terminal states. Running/scheduled jobs are not in the map (they haven't fired their trigger yet).
  • Each job has at most one timer entry. Re-schedule replaces, never duplicates.

Edge Cases

  • Controller restart: boot recovery reconstructs the map. Missed ticks fire immediately (the computed next_fire_time will be in the past, so the interval fires on first poll).
  • Long downtime: same as above. Only the next fire time matters, not how many ticks were missed. No catch-up backfill.
  • Trigger config change while running: the running job completes, reaches terminal state, and the lifecycle event picks up the new descriptor to compute the next fire time. The old timer (if any) is replaced.

@shiyasmohd
Copy link
Copy Markdown
Contributor Author

Implemented In memory Trigger in #2109

@shiyasmohd shiyasmohd closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants