Skip to content

Worker Heartbeating#2818

Open
yuandrew wants to merge 1 commit intotemporalio:masterfrom
yuandrew:worker-heartbeat-final
Open

Worker Heartbeating#2818
yuandrew wants to merge 1 commit intotemporalio:masterfrom
yuandrew:worker-heartbeat-final

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Mar 27, 2026

What was changed

Implements periodic worker heartbeat RPC that reports worker status, slot usage, poller info, host metrics, and sticky cache counters to the Temporal server. Includes HeartbeatManager, PollerTracker, and integration tests covering all heartbeat fields.

  • HeartbeatManager — Per-namespace heartbeat scheduler. Workers register/unregister; scheduler fires at the configured interval. Gracefully shuts down if server returns UNIMPLEMENTED.
  • PollerTracker — Tracks in-flight poll count and last successful poll time per worker type. Only records success when a poll returns actual work.
  • WorkflowClientOptions.workerHeartbeatInterval — New option to configure heartbeat interval. Defaults to 60s. Can be set between 1-60s, or a negative duration to disable.
  • Worker.getActiveTaskQueueTypes() — Reports WORKFLOW, ACTIVITY, and NEXUS (only when Nexus services are registered, matching Go SDK).
  • Worker.buildHeartbeat() — Assembles the full WorkerHeartbeat proto with slot info, poller info, host metrics, sticky cache counters, and timestamps.
  • Shutdown: Each worker sends ShutdownWorkerRequest with a final SHUTTING_DOWN heartbeat and active task queue types.
  • NexusWorker task counters — Fixed totalProcessedTasks/totalFailedTasks to be properly incremented.
  • TrackingSlotSupplier.getSlotSupplierKind() — Reports FixedSize vs ResourceBased in heartbeats.

Why?

New feature!

Checklist

  1. Closes Worker Heartbeating #2716

  2. How was this tested:

  • WorkerHeartbeatIntegrationTest (12 tests): End-to-end against test server — basic fields, slot info, task counters, poller info, failure metrics, interval counter reset, in-flight slot tracking, sticky cache counters/misses, multiple workers, resource-based tuner, shutdown status, disabled heartbeats
  • HeartbeatManagerTest: Unit tests for scheduler lifecycle, UNIMPLEMENTED handling, exception resilience
  • PollerTrackerTest: Unit tests for poll tracking and snapshot generation
  • TrackingSlotSupplierKindTest: Slot supplier kind detection
  • WorkerHeartbeatDeploymentVersionTest: Deployment version in heartbeats
  • All existing tests pass (./gradlew :temporal-sdk:test + spotlessCheck)
  1. Any docs updates needed?

Note

Medium Risk
Adds new periodic heartbeat RPCs and augments worker shutdown requests, introducing new background scheduling and concurrency paths that could affect worker lifecycle, metrics, and server load if misconfigured.

Overview
Adds worker heartbeating to periodically report worker status and runtime stats to the server via a new recordWorkerHeartbeat RPC, configurable through WorkflowClientOptions#setWorkerHeartbeatInterval (defaults to 60s; negative disables; 1–60s enforced).

Introduces HeartbeatManager (per-namespace scheduler) and PollerTracker to track in-flight polls/last successful poll time, and wires them through workflow/activity/nexus pollers and workers to include slot usage, poller info, sticky cache hit/miss counters, host metrics, plugin list, and deployment version in the emitted WorkerHeartbeat.

Updates graceful shutdown to send richer ShutdownWorkerRequest (task queue, instance key, active task queue types, and final SHUTTING_DOWN heartbeat) and extends the test server plus new unit/integration tests to validate heartbeat contents, lifecycle, and disabled/unsupported behavior.

Written by Cursor Bugbot for commit bfd3fc6. This will update automatically on new commits. Configure here.

Implements periodic worker heartbeat RPCs that report worker status,
slot usage, poller info, and task counters to the server.

Key components:
- HeartbeatManager: per-namespace scheduler that aggregates heartbeats
  from all workers sharing that namespace
- PollerTracker: tracks in-flight poll count and last successful poll time
- WorkflowClientOptions.workerHeartbeatInterval: configurable interval
  (default 60s, range 1-60s, negative to disable)
- TrackingSlotSupplier: extended with slot type reporting
- Worker: builds SharedNamespaceWorker heartbeat data from activity,
  workflow, and nexus worker stats
- TestWorkflowService: implements recordWorkerHeartbeat, describeWorker,
  and shutdownWorker RPCs for testing
@yuandrew yuandrew force-pushed the worker-heartbeat-final branch from 9eeea9b to bfd3fc6 Compare March 27, 2026 19:41
@yuandrew yuandrew marked this pull request as ready for review March 27, 2026 19:58
@yuandrew yuandrew requested a review from a team as a code owner March 27, 2026 19:58
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

namespace,
taskQueue,
workerInstanceKey,
getActiveTaskQueueTypes(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active task queue types captured before Nexus registration

Medium Severity

getActiveTaskQueueTypes() is called during the Worker constructor, but hasNexusServices is only set to true later when registerNexusServiceImplementation() is called. The resulting list is stored as a final field in WorkflowWorker and used in ShutdownWorkerRequest.addAllTaskQueueTypes(...). This means TASK_QUEUE_TYPE_NEXUS will never appear in shutdown requests, even when Nexus services are registered.

Additional Locations (1)
Fix in Cursor Fix in Web

}
} catch (Exception e) {
totalFailedTasks.incrementAndGet();
throw e;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed tasks counted without incrementing total processed

Medium Severity

In ActivityWorker, NexusWorker, and LocalActivityWorker, when a task handler throws an exception, totalFailedTasks is incremented in the catch block but totalProcessedTasks is not (it's only incremented on the success path). This allows totalFailedTasks to exceed totalProcessedTasks, which is semantically incorrect. WorkflowWorker correctly handles this by always incrementing totalProcessedTasks in the finally block.

Additional Locations (2)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker Heartbeating

1 participant