diff --git a/.hongdown.toml b/.hongdown.toml index 37c02dcc2..399e1c73c 100644 --- a/.hongdown.toml +++ b/.hongdown.toml @@ -89,10 +89,12 @@ proper_nouns = [ "ngrok", "Object Integrity Proofs", "OpenTelemetry", + "OpenTelemetry Collector", "Piefed", "Pixelfed", "Pleroma", "Podman Compose", + "Prometheus", "RabbitMQ", "Redis", "scrypt", diff --git a/docs/.vitepress/config.mts b/docs/.vitepress/config.mts index fef4a8dc4..7e3f512b1 100644 --- a/docs/.vitepress/config.mts +++ b/docs/.vitepress/config.mts @@ -159,6 +159,7 @@ const MANUAL = { { text: "Linting", link: "/manual/lint.md" }, { text: "Logging", link: "/manual/log.md" }, { text: "OpenTelemetry", link: "/manual/opentelemetry.md" }, + { text: "Monitoring", link: "/manual/monitoring.md" }, { text: "Benchmarking", link: "/manual/benchmarking.md" }, { text: "Deployment", link: "/manual/deploy.md" }, ], diff --git a/docs/manual/deploy.md b/docs/manual/deploy.md index 24cd88166..f0c95e3c5 100644 --- a/docs/manual/deploy.md +++ b/docs/manual/deploy.md @@ -1343,6 +1343,13 @@ signals: CPU, RSS, event-loop lag, GC pauses, connection pool utilization for your KV/MQ backend. None of these are Fedify-specific, but all of them should be in place before you take real traffic. +Fedify exposes each of these federation signals as an [OpenTelemetry +metric](./opentelemetry.md#instrumented-metrics). The [*Production +monitoring* guide](./monitoring.md) turns them into a starter dashboard and +a set of alert rules, with PromQL examples, guidance on which failures should +page versus prompt investigation, and notes on keeping metric cardinality +bounded. + ActivityPub-specific operational concerns ----------------------------------------- diff --git a/docs/manual/monitoring.md b/docs/manual/monitoring.md new file mode 100644 index 000000000..a6180e381 --- /dev/null +++ b/docs/manual/monitoring.md @@ -0,0 +1,539 @@ +--- +description: >- + A production monitoring guide for Fedify applications. Turns Fedify's + OpenTelemetry metrics into a first federation-health dashboard and a set of + alert rules, with guidance on metric cardinality and on where Fedify's + metrics end and the runtime, database, queue backend, and host platform + begin. +--- + +Production monitoring +===================== + +*The metrics this guide relies on are available since Fedify 2.3.0.* + +Federation failures are quiet. An outbox that falls behind, a remote server +that starts rejecting your signatures, a worker that stops draining the queue: +none of these necessarily trip a plain HTTP health check, and the trust-cache +divergence they cause between your server and its peers is hard to untangle +after the fact. The +[*Observability in production*](./deploy.md#observability-in-production) +section of the *Deployment* guide names the signals that matter. This guide +connects Fedify's +[OpenTelemetry metrics](./opentelemetry.md#instrumented-metrics) to the +questions an operator actually asks during an incident, and shows how to put +them on a dashboard and behind an alert. + +The examples use [Prometheus] and the [OpenTelemetry Collector] because they +are the integration points most backends share, not because Fedify prefers +them. Everything here applies to any backend that ingests OTLP or scrapes +Prometheus; where a vendor's setup begins, this guide stops and points you at +their documentation. + +[Prometheus]: https://prometheus.io/ +[OpenTelemetry Collector]: https://opentelemetry.io/docs/collector/ + + +Before you begin +---------------- + +This guide assumes metrics are already flowing out of your application. If +they are not, set up the OpenTelemetry SDK first; the [*OpenTelemetry* +chapter](./opentelemetry.md) covers the [`MeterProvider` +configuration](./opentelemetry.md#explicit-meterprovider-configuration) and the +[full list of instrumented metrics](./opentelemetry.md#instrumented-metrics), +their attributes, and their cardinality guarantees. On Deno 2.4 and later, +`OTEL_DENO=1` exports metrics without any manual SDK wiring. + +Two metrics are conditional, and a first dashboard should account for both: + +`fedify.queue.depth` +: Reported only when the queue backend implements + [`MessageQueue.getDepth()`](./mq.md#queue-depth-reporting). The Redis, + PostgreSQL, MySQL, SQLite, AMQP, and in-process backends report it; the + Deno KV and Cloudflare Workers backends return no reliable platform count, + so the gauge will be absent there. Where depth is unavailable, the + enqueue-versus-completion throughput comparison shown + [below](#queue-backlog) gives you the same falling-behind signal. + +`activitypub.document.fetch` and `activitypub.document.cache` +: Emitted only when you pass a `meterProvider` explicitly to + `createFederation()`, for the reason explained in the [*OpenTelemetry* + chapter](./opentelemetry.md#explicit-meterprovider-configuration). They do + not appear on the dashboard below, but they are useful when remote document + fetches dominate your inbox latency. + + +Getting metrics into Prometheus +------------------------------- + +### An OpenTelemetry Collector pipeline + +The Collector sits between your application and your metrics backend. Fedify +records the metrics; your application's OpenTelemetry SDK pushes them to the +Collector over OTLP, and the Collector either exposes a Prometheus scrape +endpoint or forwards the data onward over OTLP. A single pipeline can do both. + +~~~~ yaml [otel-collector-config.yaml] +receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + +processors: + batch: {} + +exporters: + # Expose a /metrics endpoint for Prometheus to scrape. + prometheus: + endpoint: 0.0.0.0:9464 + # add_metric_suffixes defaults to true; see the naming note below. + + # Or forward to any OTLP-speaking backend instead of (or as well as) scraping. + otlphttp: + endpoint: https://otlp.your-backend.example + +service: + pipelines: + metrics: + receivers: [otlp] + processors: [batch] + exporters: [prometheus] # add otlphttp here to do both +~~~~ + +Point the application at the Collector with the standard environment +variable, and the SDK does the rest: + +~~~~ sh +OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318 +~~~~ + +Prometheus then scrapes the Collector at `otel-collector:9464`. A managed +backend (Grafana Cloud, Honeycomb, Datadog, and others) usually accepts OTLP +directly, in which case you swap the `prometheus` exporter for `otlphttp` and +skip the scrape entirely. Either way, the rest of this guide is the same; only +the names you type into the query bar differ, which is the subject of the next +section. + +### How the metric names appear once scraped + +OpenTelemetry metric names and Prometheus metric names are not spelled the +same way. When the Collector's `prometheus` exporter (or Prometheus's own OTLP +ingestion) translates them, three things happen with the default settings: + + - Dots become underscores, in both metric names and attribute (label) names. + `activitypub.remote.host` becomes the label `activitypub_remote_host`. + - The unit is appended to the name. The `ms` unit becomes a `_milliseconds` + suffix; annotation units written in curly braces (`{request}`, `{task}`, + `{message}`) are dropped, not appended. + - Counters gain a `_total` suffix, and each histogram expands into + `_bucket`, `_sum`, and `_count` series. + +So the metrics you query look like this: + +| OpenTelemetry metric | Instrument | Prometheus time series | +| --------------------------------------------- | --------------- | ----------------------------------------------------------------------------- | +| `activitypub.delivery.sent` | counter | `activitypub_delivery_sent_total` | +| `activitypub.delivery.permanent_failure` | counter | `activitypub_delivery_permanent_failure_total` | +| `activitypub.delivery.duration` | histogram | `activitypub_delivery_duration_milliseconds_{bucket,sum,count}` | +| `activitypub.inbox.processing_duration` | histogram | `activitypub_inbox_processing_duration_milliseconds_{bucket,sum,count}` | +| `activitypub.signature.verification_failure` | counter | `activitypub_signature_verification_failure_total` | +| `activitypub.signature.verification.duration` | histogram | `activitypub_signature_verification_duration_milliseconds_{bucket,sum,count}` | +| `activitypub.signature.key_fetch.duration` | histogram | `activitypub_signature_key_fetch_duration_milliseconds_{bucket,sum,count}` | +| `fedify.queue.task.enqueued` | counter | `fedify_queue_task_enqueued_total` | +| `fedify.queue.task.completed` | counter | `fedify_queue_task_completed_total` | +| `fedify.queue.task.in_flight` | up down counter | `fedify_queue_task_in_flight` | +| `fedify.queue.depth` | gauge | `fedify_queue_depth` | + +> [!NOTE] +> The exact names depend on how your pipeline is configured. Disabling unit +> and type suffixes on the Collector's `prometheus` exporter drops the `_total` +> and `_milliseconds` segments, and a non-default name-translation strategy +> (the ones that preserve UTF-8 names) can keep the dots instead of converting +> them to underscores. When a query returns nothing, check the actual series +> names against the Collector's `/metrics` output or your backend's metric +> explorer before assuming the metric is missing. The examples below assume +> the default translation. + + +A first federation dashboard +---------------------------- + +Six panels are enough for a first pass at federation health. Each one answers +a question you would otherwise have to reconstruct from traces or logs after +something has already gone wrong. + +### Queue backlog + +*Are outgoing and incoming activities draining as fast as they arrive?* + +Where the backend reports depth, plot `fedify_queue_depth` for the `queued` +state, broken out by role. The `queued` state is the total of waiting +messages, so query it alone rather than summing `queued`, `ready`, and +`delayed`, which would count the same backlog more than once: + +~~~~ promql +max by (fedify_queue_role) (fedify_queue_depth{fedify_queue_depth_state="queued"}) +~~~~ + +Use `max` here, not `sum`. When several observers report the same queue, +whether that is multiple replicas behind a shared Redis or PostgreSQL backend +or several `Federation` instances sharing one `MeterProvider`, each one reads +the backend's full depth rather than a private shard. Summing multiplies the +backlog by the number of observers and makes every depth alert page early; +`max` reads the true depth. Sum only when each instance owns a separate queue +backend. + +Pair it with how many tasks each process is actively working, which is a +gauge-like UpDownCounter and is reported per process, so sum it across replicas: + +~~~~ promql +sum by (fedify_queue_role) (fedify_queue_task_in_flight) +~~~~ + +When the backend reports no depth (Deno KV, Cloudflare Workers), or as a +second opinion when it does, watch the throughput balance instead. Enqueue +rate running consistently above completion rate is the definition of falling +behind: + +~~~~ promql +sum by (fedify_queue_role) (rate(fedify_queue_task_enqueued_total[5m])) + - sum by (fedify_queue_role) (rate(fedify_queue_task_completed_total[5m])) +~~~~ + +A backlog that empties during quiet periods is healthy. One that never +returns to zero overnight means you are permanently behind and need more +worker capacity or a faster backend, not a higher alert threshold. + +### Inbox processing latency + +*How long does it take to finish the side effects of an incoming activity?* + +`activitypub.inbox.processing_duration` measures the listener's own work. Read +it as a high percentile rather than an average. When an inbox `queue` is +configured, that work runs in the queue worker after Fedify has already +answered the remote with `202 Accepted`, so a slow tail here means slow side +effects, not remote servers waiting on you. The latency a remote actually +experiences lives on `fedify.http.server.request.duration` for the inbox +endpoints; only with inline (no-queue) listeners do the two coincide. + +~~~~ promql +histogram_quantile( + 0.95, + sum by (le) (rate(activitypub_inbox_processing_duration_milliseconds_bucket[5m])) +) +~~~~ + +Spikes here usually trace back to one of two causes: a queue backlog upstream, +or a slow dependency inside the listener (a database write, a remote key fetch +during signature verification). The signature-latency panel below helps +separate the second case from the first. + +### Outbound delivery attempts + +*How much delivery work is happening, and how much of it succeeds?* + +`activitypub.delivery.sent` counts every per-recipient attempt and carries an +`activitypub_delivery_success` label, so one expression gives you both volume +and the success split: + +~~~~ promql +sum by (activitypub_delivery_success) (rate(activitypub_delivery_sent_total[5m])) +~~~~ + +### Outbound delivery failure rate + +*What fraction of delivery attempts are failing right now?* + +The failed-attempt fraction is the per-attempt complement of the success rate +that the *Deployment* guide calls out as a core federation signal: + +~~~~ promql +sum(rate(activitypub_delivery_sent_total{activitypub_delivery_success="false"}[5m])) + / sum(rate(activitypub_delivery_sent_total[5m])) +~~~~ + +Keep this distinct from permanent failures. A failed attempt is usually +transient and will be retried; the next panel counts only the deliveries a +remote rejected with a permanent-failure status. A failure fraction that +climbs from a few percent toward a fifth or more, across many remote hosts at +once, points at your own outbound path (DNS, egress, a misconfigured proxy) +rather than at any single peer. + +### Permanent delivery failures + +*Which deliveries did a remote reject with a permanent-failure status?* + +`activitypub.delivery.permanent_failure` increments once per recipient that a +remote rejected with a permanent-failure status, with that status code +attached: + +~~~~ promql +sum by (http_response_status_code) ( + rate(activitypub_delivery_permanent_failure_total[5m]) +) +~~~~ + +The `404` and `410` rows are the fediverse's normal background churn (see the +[alerting section](#spikes-in-remote-404-and-410-responses) for why they rarely +deserve a page). Other codes are worth a closer look: a sustained band of +permanent failures on an unusual status often means one large instance has +changed how it rejects you. + +This counter only sees deliveries a remote rejected with a permanent-failure +status code (`404` and `410` by default, plus anything you add to +`~FederationOptions.permanentFailureStatusCodes`). Deliveries Fedify abandons +after its outbox retry policy exhausts on transport errors or transient `5xx` +responses land on `activitypub.outbox.activity` with +`activitypub.processing.result="abandoned"` instead. Add that series to see +every dropped delivery, not just the status-coded ones: + +~~~~ promql +sum(rate(activitypub_outbox_activity_total{activitypub_processing_result="abandoned"}[5m])) +~~~~ + +### Signature verification latency + +*How long does verifying an inbound signature take, and where does the time +go?* + +`activitypub.signature.verification.duration` covers the whole verification +path, including any remote key fetch, and splits cleanly by signature kind: + +~~~~ promql +histogram_quantile( + 0.95, + sum by (le, activitypub_signature_kind) + (rate(activitypub_signature_verification_duration_milliseconds_bucket[5m])) +) +~~~~ + +If the total looks slow, compare it against +`activitypub_signature_key_fetch_duration_milliseconds_bucket`, which isolates +the key-lookup portion. When key fetches dominate, the problem is a slow or +flaky remote key host or a cold key cache, not your verification code. + + +Alerting +-------- + +The thresholds below are starting points, not defaults. The right number for +a queue backlog or a latency percentile depends on your traffic shape, your +worker count, and how much delay your users tolerate, and the only way to find +it is to watch the dashboard for a week or two first. Treat every figure here +as a placeholder to replace once you know what normal looks like on your +server. + +Examples are written as Prometheus alerting rules. The expressions translate +directly to any backend with a comparable rule language. + +### Growing queue backlog + +A queue that is falling behind is the earliest warning that worker capacity +cannot keep up. Alert on the throughput deficit rather than an absolute depth, +because the deficit works on every backend and does not need retuning when +traffic grows: + +~~~~ yaml +- alert: FedifyQueueFallingBehind + expr: | + sum by (fedify_queue_role) (rate(fedify_queue_task_enqueued_total[10m])) + - ( + sum by (fedify_queue_role) (rate(fedify_queue_task_completed_total[10m])) + or sum by (fedify_queue_role) (rate(fedify_queue_task_enqueued_total[10m])) * 0 + ) + > 0 + for: 30m + annotations: + summary: "Fedify {{ $labels.fedify_queue_role }} queue is not draining" +~~~~ + +The `or … * 0` term is not decoration. When a role's workers stall outright, +its `fedify_queue_task_completed_total` series can stop existing, and a plain +`enqueued > completed` comparison would then match nothing and stay silent in +exactly the case you most want to catch. Substituting a zero-valued series +keeps the role in the result so the deficit still fires. The `for: 30m` clause +does the rest of the work: short bursts where enqueues briefly outpace +completions are normal under load, and you only want to hear about a deficit +that persists long enough to mean the queue will not recover on its own. Where +the backend reports depth, an absolute +`fedify_queue_depth{fedify_queue_depth_state="queued"}` ceiling makes a useful +second alert once you know your steady-state depth. + +### Outbound delivery failure spike + +A failure fraction that stays high across many peers indicates a problem on +your side of the network: + +~~~~ yaml +- alert: FedifyOutboundDeliveryFailing + expr: | + sum(rate(activitypub_delivery_sent_total{activitypub_delivery_success="false"}[5m])) + / sum(rate(activitypub_delivery_sent_total[5m])) + > 0.2 + for: 10m + annotations: + summary: "Over 20% of outbound delivery attempts are failing" +~~~~ + +### Sustained inbox latency + +A single slow request is noise; a high percentile that stays elevated means +side-effect processing is backing up, usually behind a slow database write or +a remote key fetch during verification. Behind an inbox queue this latency is +decoupled from what remote servers wait on, so pair it with a +`fedify.http.server.request.duration` alert on the inbox endpoints to catch +remote-facing slowness too: + +~~~~ yaml +- alert: FedifyInboxLatencyHigh + expr: | + histogram_quantile(0.95, + sum by (le) (rate(activitypub_inbox_processing_duration_milliseconds_bucket[5m])) + ) > 2000 + for: 15m + annotations: + summary: "Inbox processing p95 above 2s for 15 minutes" +~~~~ + +### Spikes in remote 404 and 410 responses + +`404 Not Found` and `410 Gone` from remote inboxes are ordinary fediverse +behavior: accounts get deleted, instances shut down, paths change. Fedify's +default `~FederationOptions.permanentFailureStatusCodes` already stops retrying +them, so a steady trickle needs no human at all. A *spike* is worth knowing +about, because it usually means a large instance you federate with has gone +away or restructured its URLs, and you may want to prune orphaned follower +records. Route this to a ticket or a chat channel, not to a pager: + +~~~~ yaml +- alert: FedifyRemoteGoneSpike + expr: | + sum(increase(activitypub_delivery_permanent_failure_total{ + http_response_status_code=~"404|410" + }[1h])) > 50 + labels: + severity: ticket + annotations: + summary: "Elevated 404/410 from remote inboxes; check for a departed instance" +~~~~ + +The one-hour lookback is deliberate. When a large instance disappears, Fedify +records a short burst of `404`/`410` permanent failures and then stops retrying +them, so a narrow window paired with a long `for` clause would let the burst +age out before the alert ever became eligible to fire. Counting over a full +hour with no `for` catches the burst, then clears itself once it ages out. The +`severity: ticket` label keeps it off the pager: nothing here is broken on your +server, and this is an invitation to investigate, not an incident. + +### Signature verification failures + +A failed signature verification means Fedify rejected an inbound activity. A +handful from one misbehaving remote is expected. A broad, sudden rise across +many peers usually has a cause on your end: clock drift pushing signatures +outside `~FederationOptions.signatureTimeWindow` (see [*Handling inbound +failures*](./deploy.md#handling-inbound-failures) in the *Deployment* guide), or +an actor key that was rotated without keeping the old key served during the +transition. Break the alert down by reason so the two cases stay separable: + +~~~~ yaml +- alert: FedifySignatureVerificationFailures + expr: | + sum by (activitypub_verification_failure_reason) ( + increase(activitypub_signature_verification_failure_total[5m]) + ) > 10 + for: 15m + annotations: + summary: "Sustained signature verification failures ({{ $labels.activitypub_verification_failure_reason }})" +~~~~ + +A `keyFetchError` reason points outward, at a remote key host you could not +reach. A signature mismatch that suddenly affects everyone points inward, at +your clock or your keys, and is the one to escalate. + + +Keeping metric cardinality bounded +---------------------------------- + +High metric cardinality is a real hazard in federation code, because the raw +material (actor IDs, object IDs, inbox URLs, remote URLs) is unbounded and +attacker-influenced. Fedify's metrics are designed to stay bounded: they never +attach a raw URL, actor ID, object ID, or inbox URL as a label, and the +attributes they do attach come from small fixed enumerations. The relevant +work for a dashboard or alert author is mostly to not undo that. + +`activitypub_remote_host` is the one label whose *set of values* grows with the +fediverse. Fedify normalizes each value to a hostname plus any non-default +port, with no path or query string, so a single remote cannot create more than +one series. The number of remote hosts you talk to, though, is as large as +your federation graph. Aggregate this label away by default, and break it out +only when you are investigating a specific problem: + +~~~~ promql +# For a dashboard: total, host-independent. +sum(rate(activitypub_delivery_permanent_failure_total[5m])) + +# For an investigation: the ten worst hosts, bounded by topk. +topk(10, sum by (activitypub_remote_host) ( + rate(activitypub_delivery_permanent_failure_total{http_response_status_code=~"404|410"}[1h]) +)) +~~~~ + +`activitypub_activity_type` is bounded in practice to the ActivityStreams +vocabulary, but the value originates in remote-supplied documents. If you ever +see its series count climb (an instance probing you with unusual or extension +types, for example), aggregate it away in the affected panels or drop it with a +`metric_relabel_config` at scrape time. + +The same discipline applies to anything you build on top of these metrics. +Recording rules, relabeling, and derived metrics should never reintroduce an +identifier or URL that Fedify deliberately kept out. When you need the full +URL, actor ID, or key ID to debug a specific event, it is on the corresponding +[span](./opentelemetry.md#instrumented-spans), where sampling keeps the +cardinality cost contained, not on the metric. + + +Where Fedify's metrics stop +--------------------------- + +Fedify instruments federation: delivery, inbox and outbox processing, +signatures, key and document lookups, collections, WebFinger, and its own queue +workers. It does not, and should not, measure the layers beneath it. A +complete production view needs those layers too, from sources Fedify has no part +in: + +Process and runtime +: CPU, resident memory, heap usage, event-loop lag, and garbage-collection + pauses. These come from runtime instrumentation: + `@opentelemetry/instrumentation-runtime-node` on Node.js, the built-in + exporter on Deno (`OTEL_DENO=1`), and the equivalent for Bun. + +Database and cache backend +: Connection-pool saturation, PostgreSQL query latency, Redis command + latency. A pool exhausted behind your KV store or message queue looks, + from Fedify's side, exactly like a slow queue; you need the backend's own + metrics (from `postgres_exporter`, `redis_exporter`, or the driver's + instrumentation) to tell the two apart. + +Queue backend internals +: `fedify.queue.depth` reports what the backend tells Fedify through + `getDepth()`. The broker's own view (RabbitMQ's management metrics, + Redis keyspace stats, a cloud queue's console) is separate, often richer, + and the place to look when depth alone does not explain a stall. + +Host and platform +: Disk, network, container CPU and memory limits. These come from a host + metrics agent (`node_exporter`, the Collector's `hostmetrics` receiver, + cAdvisor) or from your platform's built-in monitoring. + +The Collector is a convenient place to gather several of these at once. Adding +a `hostmetrics` receiver to the pipeline above, alongside `otlp`, pulls host +signals through the same export path as Fedify's application metrics, so they +land in one backend and one dashboard. + +Get them in place before you serve real traffic. The [*Deployment* +guide](./deploy.md#observability-in-production) folds them into the same +pre-launch checklist as the federation signals on this page. diff --git a/docs/manual/opentelemetry.md b/docs/manual/opentelemetry.md index b0d7476bf..3ba5bd6c1 100644 --- a/docs/manual/opentelemetry.md +++ b/docs/manual/opentelemetry.md @@ -919,6 +919,12 @@ metric retains the matched endpoint (for example `actor`) so that fault-attribution stays per endpoint; `error` is only used when classification itself failed. +For turning these metrics into a production dashboard and alert rules, see the +[*Production monitoring* guide](./monitoring.md). It maps the metrics above to +the federation-health questions operators ask, with PromQL examples, the +OpenTelemetry-to-Prometheus naming translation, and cardinality guidance for +dashboard and alert authors. + [URI Template]: https://datatracker.ietf.org/doc/html/rfc6570