Conversation
Implements periodic worker heartbeat RPCs that report worker status, slot usage, poller info, and task counters to the server. Key components: - HeartbeatManager: per-namespace scheduler that aggregates heartbeats from all workers sharing that namespace - PollerTracker: tracks in-flight poll count and last successful poll time - WorkflowClientOptions.workerHeartbeatInterval: configurable interval (default 60s, range 1-60s, negative to disable) - TrackingSlotSupplier: extended with slot type reporting - Worker: builds SharedNamespaceWorker heartbeat data from activity, workflow, and nexus worker stats - TestWorkflowService: implements recordWorkerHeartbeat, describeWorker, and shutdownWorker RPCs for testing
9eeea9b to
bfd3fc6
Compare
| namespace, | ||
| taskQueue, | ||
| workerInstanceKey, | ||
| getActiveTaskQueueTypes(), |
There was a problem hiding this comment.
Active task queue types captured before Nexus registration
Medium Severity
getActiveTaskQueueTypes() is called during the Worker constructor, but hasNexusServices is only set to true later when registerNexusServiceImplementation() is called. The resulting list is stored as a final field in WorkflowWorker and used in ShutdownWorkerRequest.addAllTaskQueueTypes(...). This means TASK_QUEUE_TYPE_NEXUS will never appear in shutdown requests, even when Nexus services are registered.
Additional Locations (1)
| } | ||
| } catch (Exception e) { | ||
| totalFailedTasks.incrementAndGet(); | ||
| throw e; |
There was a problem hiding this comment.
Failed tasks counted without incrementing total processed
Medium Severity
In ActivityWorker, NexusWorker, and LocalActivityWorker, when a task handler throws an exception, totalFailedTasks is incremented in the catch block but totalProcessedTasks is not (it's only incremented on the success path). This allows totalFailedTasks to exceed totalProcessedTasks, which is semantically incorrect. WorkflowWorker correctly handles this by always incrementing totalProcessedTasks in the finally block.


What was changed
Implements periodic worker heartbeat RPC that reports worker status, slot usage, poller info, host metrics, and sticky cache counters to the Temporal server. Includes HeartbeatManager, PollerTracker, and integration tests covering all heartbeat fields.
HeartbeatManager— Per-namespace heartbeat scheduler. Workers register/unregister; scheduler fires at the configured interval. Gracefully shuts down if server returns UNIMPLEMENTED.PollerTracker— Tracks in-flight poll count and last successful poll time per worker type. Only records success when a poll returns actual work.WorkflowClientOptions.workerHeartbeatInterval— New option to configure heartbeat interval. Defaults to 60s. Can be set between 1-60s, or a negative duration to disable.Worker.getActiveTaskQueueTypes()— Reports WORKFLOW, ACTIVITY, and NEXUS (only when Nexus services are registered, matching Go SDK).Worker.buildHeartbeat()— Assembles the full WorkerHeartbeat proto with slot info, poller info, host metrics, sticky cache counters, and timestamps.TrackingSlotSupplier.getSlotSupplierKind()— Reports FixedSize vs ResourceBased in heartbeats.Why?
New feature!
Checklist
Closes Worker Heartbeating #2716
How was this tested:
Note
Medium Risk
Adds new periodic heartbeat RPCs and augments worker shutdown requests, introducing new background scheduling and concurrency paths that could affect worker lifecycle, metrics, and server load if misconfigured.
Overview
Adds worker heartbeating to periodically report worker status and runtime stats to the server via a new
recordWorkerHeartbeatRPC, configurable throughWorkflowClientOptions#setWorkerHeartbeatInterval(defaults to 60s; negative disables; 1–60s enforced).Introduces
HeartbeatManager(per-namespace scheduler) andPollerTrackerto track in-flight polls/last successful poll time, and wires them through workflow/activity/nexus pollers and workers to include slot usage, poller info, sticky cache hit/miss counters, host metrics, plugin list, and deployment version in the emittedWorkerHeartbeat.Updates graceful shutdown to send richer
ShutdownWorkerRequest(task queue, instance key, active task queue types, and final SHUTTING_DOWN heartbeat) and extends the test server plus new unit/integration tests to validate heartbeat contents, lifecycle, and disabled/unsupported behavior.Written by Cursor Bugbot for commit bfd3fc6. This will update automatically on new commits. Configure here.