feat(metrics): Add per-component CPU usage metric (poll-hook) by gwenaskell · Pull Request #25317 · vectordotdev/vector

gwenaskell · 2026-04-28T15:17:22Z

Summary

Alternative implementation of #25185. Same metric (component_cpu_usage_ns_total), same Tier 1/Tier 2 platform support; the difference is how CPU time is sampled for the concurrent transform path: it is now hooked onto the spawned task's Future::poll boundary via a thin CpuTimedFuture adapter, rather than measured inline inside the async block.

Within a single poll, tokio's cooperative scheduler guarantees the task cannot migrate to another worker thread and no other task can run on the current thread, so each (before_poll, after_poll) pair is a clean per-thread CPU measurement. Multi-poll futures accumulate correctly — which keeps the wrapper applicable if the spawned body ever grows .await points and makes the future extension to task transforms a one-line wrap.

run_inline is unchanged: its body is sync and already runs in the transform's own task, so direct ThreadTime brackets remain the simplest correct option there.

See the RFC for more details.

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Closes: OPA-5012
Alternative to: feat(metrics): Add per-component CPU usage metric #25185

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.

…dary For the concurrent transform path, replace the inline ThreadTime brackets inside the spawned async block with a CpuTimedFuture adapter that samples thread CPU time around every Future::poll and accumulates the delta into the component_cpu_usage_ns_total counter. Within a single poll tokio cannot migrate the task or run another task on this thread, so each pair is a clean per-thread CPU measurement; multi-poll futures accumulate correctly, which keeps the wrapper applicable if the body ever grows .await points or to future task-transform coverage. The inline path is unchanged: its body is sync and runs in the transform's own task, so direct measurement is the simplest correct option. The RFC is updated to describe the wrapper approach in Rationale, Plan Of Attack, and Future Improvements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

+The value is cumulative CPU nanoseconds consumed by the component. Operators
+use it to compute CPU core utilization:
+
+```promql


+  with each poll independently sampling the thread it ran on. This isolates
+  the timing concern from the transform body and keeps it robust if the body
+  ever grows `.await` points.
+- **Low overhead.** Two `clock_gettime` calls per poll (~80ns total on Linux)


+far cheaper. Per-event latency can be derived from the counter and
+`events_sent_total` if needed (`cpu_ns / events = avg cpu ns per event`).
+
+### `getrusage(RUSAGE_THREAD)` instead of `clock_gettime`


+On Linux, `getrusage(RUSAGE_THREAD)` also provides per-thread CPU time (as
+`ru_utime` + `ru_stime`).
+
+**Not preferred because:** `clock_gettime(CLOCK_THREAD_CPUTIME_ID)` has


…trait Replace the explicit CpuTimedFuture::new constructor with a CpuTimedExt trait so the wrapper composes naturally with .in_current_span() and similar future-extension methods: async move { ... } .cpu_timed(cpu_ns.clone()) .in_current_span() Mirrors the style of tracing::Instrument::in_current_span. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a `cpu_ns: Counter` field to `TransformContext`, defaulting to `Counter::noop()`. The topology builder resolves the counter once, inside the transform `error_span!` so it is tagged with the right component_id / component_kind / component_type, and stores it on the context. This is the single Counter handle every transform path consumes — sync, task, and any helper tokio tasks — so label resolution and recorder lookup are paid once at construction time rather than on every poll. For task transforms (`build_task_transform`), wrap the outer task future with `.cpu_timed(counter)` before `.boxed()`. CPU time is accumulated across every poll of the task; multi-poll futures accumulate correctly, and time the task spends parked in `Pending` is naturally excluded. For transforms that spawn long-running helper tokio tasks at construction time, plumb the counter through and `.cpu_timed(...)` those spawns too: - `aws_ec2_metadata`: the periodic IMDS-refresh worker. - `throttle`'s `RateLimiterRunner`: the periodic `retain_recent` flush loop. The counter is plumbed through `RateLimiterRunner::start` as a parameter. Without this, those helpers' CPU would be silently excluded. The bracket scope for task transforms is slightly wider than for sync transforms — it includes input-channel polls, the Utilization / OutputUtilization wrappers, and the fanout-send loop — but channel / fanout overhead is small relative to transform work, so the metric remains comparable across kinds. RFC and changelog updated to reflect the broader coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restructure the "Scope of the measurement" rationale bullet to make the upstream-isolation property explicit. Vector components only communicate via BufferReceiver / BufferSender channels (never via stream combinators chained across component boundaries), so polling a task transform's input dequeues items but never runs the upstream's code. Upstream CPU was charged to its own cpu_ns when it ran in its own task. Spell out what is and is not included in cpu_ns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Threading cpu_ns as a separate argument pushed build_task_transform above clippy's too_many_arguments threshold. Mirror build_sync_transform by taking the whole TransformNode and destructuring at the top. The later `let mut outputs = HashMap::new()` shadows the destructured Vec — fine since the Vec is only used earlier when building the schema_definition_map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gwenaskell · 2026-04-28T16:49:47Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a66b66dd7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T16:54:43Z

 			tags:              _component_tags
 		}
+		component_cpu_usage_ns_total: {
+			description:       "The CPU time consumed by a component in nanoseconds. Available for sync and function transforms."


Update CPU metric docs to include task transforms

The new implementation records component_cpu_usage_ns_total for task transforms as well (see build_task_transform now wrapping the task future with .cpu_timed(...)), but this description still says the metric is only available for sync/function transforms. This creates incorrect user-facing documentation and can cause operators to miss or misinterpret task-transform CPU data.

Useful? React with 👍 / 👎.

bruceg

This looks basically sound so long as the performance impact is minimal

bruceg · 2026-04-28T22:55:26Z

                merged_schema_definition: merged_definition.clone(),
                schema: self.config.schema,
                extra_context: self.extra_context.clone(),
+                cpu_ns,


nit: could just instantiate the counter! here like all the other fields.

bruceg · 2026-04-28T23:00:22Z

+            self.quota,
+            self.clock.clone(),
+            self.flush_keys_interval,
+            self.cpu_ns.clone(),


At some point it might make more sense to just pass self.

bruceg · 2026-04-28T23:02:23Z

+/// Extension trait that wraps a future in [`CpuTimedFuture`] via a chained
+/// call:
+///
+/// ```ignore
+/// async move { /* work */ }.cpu_timed(counter)
+/// ```
+///
+/// Mirrors the style of [`tracing::Instrument::in_current_span`].
+pub(crate) trait CpuTimedExt: Future + Sized {
+    fn cpu_timed(self, counter: Counter) -> CpuTimedFuture<Self> {
+        CpuTimedFuture {
+            inner: self,
+            counter,
+        }
+    }
+}
+
+impl<F: Future> CpuTimedExt for F {}


If I'm reading right, all calls to cpu_timed follow a tokio::spawn. What about providing a wrapper for that whole tokio::spawn(future).cpu_timed(counter) sequence instead, like fn spawn_timed(…)? It will also make it more visible when a task is spawned without adding the timer accounting.

bruceg · 2026-04-28T23:03:43Z

+        let this = self.project();
+        let t0 = ThreadTime::now();


It's trivial, but consider flipping these operations to let this go straight from project() to usage.

bruceg · 2026-04-28T23:08:20Z

+
+This metric is always emitted for transforms; there is no configuration knob.
+
+## Rationale


This RFC is missing the implementation plan, which should be the primary focus of the rationale here. Basically move most of the plan into an implementation section and just reference points in the plan. A bunch of this rationale is also explaining how it is implemented too.

One note; this PR is both an implementation and an RFC, which is unusual. We usually implement the RFC after it has been approved.

bruceg · 2026-04-28T23:09:29Z

+  The channel-poll / fanout-send bookkeeping our wrapper does include is
+  small relative to the transform's own work, so the metric remains a
+  meaningful comparator across transform kinds.


This seems to refer to past implementation.

bruceg · 2026-04-28T23:10:27Z

+  is negligible relative to the work `transform_all` performs.
+- **No accumulation errors.** The counter stores `u64` nanoseconds; each
+  increment is exact integer arithmetic. The single `u64 → f64` cast at scrape
+  time has bounded, non-accumulated error.


nit: the "error" is specifically precision loss.

bruceg · 2026-04-28T23:11:12Z

+- **Platform-specific code.** The precise implementation uses `cfg`-gated FFI
+  for Linux, macOS, and Windows. Other platforms fall back to wall-clock time,
+  giving three maintained code paths plus one fallback.


Should we instead refuse to emit the metric if we can't actually get CPU time? It's a misleading measure otherwise.

Yes, and it might be easy to do with this approach, I'll look into it

bruceg · 2026-04-28T23:13:08Z

+1. **User/system split:** Should we report user and system CPU time separately
+   (as `mode="user"` / `mode="system"` tags) like `host_cpu_seconds_total`
+   does? The Linux API supports this. It adds cardinality but helps distinguish
+   transforms that trigger syscalls (e.g., enrichment table lookups) from pure
+   computation.


FWIW for function/sync transforms, the system CPU time should be 0 or effectively 0.

gwenaskell and others added 14 commits April 14, 2026 14:17

add the RFC

28547fe

add impl for linux and macos

a078c1e

add windows implementation

c3eae53

rename to component_cpu_usage_seconds_total

8e4822b

add documentation

8b950b9

add changelog

7ab6954

format

b76f8e7

use a counter and report nanoseconds

5a6a6b9

update changelog

4e5b7bf

update docs

d39d8e8

update and rename rfc

9957e81

fix typo

5a4c3f5

remove implementation from rfc

70a8aa9

gwenaskell requested review from a team as code owners April 28, 2026 15:17

github-actions Bot added work in progress domain: topology Anything related to Vector's topology code domain: external docs Anything related to Vector's external, public documentation domain: rfc and removed work in progress labels Apr 28, 2026

github-advanced-security AI found potential problems Apr 28, 2026

View reviewed changes

github-actions Bot added the work in progress label Apr 28, 2026

github-actions Bot added the domain: transforms Anything related to Vector's transform components label Apr 28, 2026

gwenaskell and others added 3 commits April 28, 2026 18:23

fix formatting

7a66b66

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

bruceg reviewed Apr 28, 2026

View reviewed changes


		This metric is always emitted for transforms; there is no configuration knob.

		## Rationale

Conversation

gwenaskell commented Apr 28, 2026

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

gwenaskell commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

bruceg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants