Skip to content

Example Grafana dashboard and Prometheus alert rules for Fedify metrics #745

@dahlia

Description

@dahlia

Summary

Add runnable monitoring examples for Fedify's OpenTelemetry metrics, including a Grafana dashboard, Prometheus alert rules, and a minimal OpenTelemetry Collector setup.

The production monitoring guide in #743 should explain how operators think about Fedify metrics. This issue should provide files that users can copy, run, and adapt. The examples should make the Milestone 6 dashboard deliverable concrete by showing panels for activity delivery success rates, peer connectivity health, queue backlog, signature verification latency, inbox processing latency, and resource utilization patterns.

A likely location is examples/monitoring/, since these files are meant to be executable examples rather than prose-only documentation.

Problem

Fedify's planned metrics work will expose the raw signals needed for production observability, but raw metric names are not enough for many users. Operators still need to know how those metrics fit together in a dashboard, which alert rules are safe starting points, and how to wire OpenTelemetry metrics into a common Prometheus/Grafana setup.

The documentation issue #743 covers the written guide, but it should not have to carry exported dashboard files and runnable configuration. Keeping the example files in a separate issue gives us a clearer deliverable and makes it easier to test that the example setup still starts.

Without these examples, each deployment has to rebuild the same baseline monitoring setup from scratch, including choices that are easy to get wrong:

  • Which labels are safe to group by without creating unbounded cardinality.
  • Which delivery failures should page someone and which should only create investigation alerts.
  • How to distinguish Fedify metrics from runtime, database, queue backend, and host metrics.
  • How to connect OpenTelemetry Collector, Prometheus, and Grafana in a minimal local stack.
  • How to visualize queue backlog and delivery failures without exposing actor IDs, inbox URLs, object IDs, or raw remote URLs as labels.

Proposed solution

Add a monitoring example under examples/monitoring/.

One possible file layout:

examples/monitoring/
  README.md
  docker-compose.yml
  otel-collector.yml
  prometheus.yml
  prometheus-rules.yml
  grafana/
    dashboards/
      fedify-overview.json
    provisioning/
      dashboards.yml
      datasources.yml

The example should run a local observability stack that can receive or scrape metrics from a Fedify application:

docker compose -f examples/monitoring/docker-compose.yml up

The OpenTelemetry Collector config should expose a common path for local development, for example OTLP input on localhost:4317/localhost:4318 and Prometheus-compatible output for Prometheus scraping. The exact setup can change if the metrics implementation chooses a different recommended path, but the example should stay vendor-neutral and avoid managed-service-specific configuration.

The Grafana dashboard should include a baseline Fedify overview with panels such as:

  • HTTP request rate and latency by Fedify endpoint.
  • Inbox processing latency and error rate.
  • Outbound delivery attempts, success rate, retry rate, and permanent failures.
  • Queue depth by queue role and state.
  • Queue task processing duration and failure rate.
  • Signature verification duration and failure rate.
  • WebFinger and actor discovery latency and failure rate.
  • Remote document/key lookup latency and failure rate.
  • Peer connectivity health by bounded remote host attribute, where the metric design supports it.
  • Resource utilization context from runtime, process, database, or queue backend metrics, clearly separated from Fedify-owned metrics.

The Prometheus alert rules should provide starting points for common production symptoms:

  • Growing queue backlog.
  • Sustained inbox processing latency.
  • Outbound delivery failure spikes.
  • Permanent delivery failure spikes.
  • Sudden increase in signature verification failures.
  • Sustained discovery or key lookup failures.
  • Sustained 404/410 increases from remote delivery attempts, marked as investigation alerts rather than paging alerts by default.
  • Missing Fedify metrics from an expected target.

The alert thresholds should be explicitly labeled as examples, not defaults suitable for every deployment.

The example README.md should explain how to run the stack, how to point a local Fedify app at it, and how the example relates to #743. It should also explain that the dashboard and rules depend on the metric names finalized under #316 and #619, plus the follow-up metrics issues #735 through #742.

Scope

  • Add runnable monitoring example files, not just documentation prose.
  • Include a Grafana dashboard JSON file.
  • Include Prometheus alert rule examples.
  • Include a minimal OpenTelemetry Collector and Prometheus setup.
  • Include Grafana provisioning files so the dashboard loads automatically in the local example stack.
  • Keep the examples vendor-neutral beyond OpenTelemetry Collector, Prometheus, and Grafana.
  • Use only stable metric names and bounded attributes finalized by the metrics implementation issues.
  • Make resource utilization panels clearly separate from Fedify-owned metrics.
  • Link from the production monitoring guide in Production monitoring dashboards and alerting guide #743 once both pages/files exist.
  • Do not invent metric names before the implementation issues settle them.
  • Do not include secrets, hosted backend tokens, or vendor-specific endpoints.

Acceptance criteria

  • examples/monitoring/ contains a runnable local monitoring stack.
  • The example includes Grafana dashboard JSON for Fedify metrics.
  • The example includes Prometheus alert rules for common Fedify production failure modes.
  • The example includes OpenTelemetry Collector and Prometheus configuration.
  • Grafana loads the Fedify dashboard automatically when the example stack starts.
  • The dashboard covers delivery success, peer/discovery health, queue backlog, signature verification, HTTP request performance, and resource utilization context.
  • Alert rules avoid unbounded labels such as actor IDs, object IDs, inbox URLs, full URLs, and raw route parameter values.
  • Alert annotations explain what the alert means and whether it is intended as paging or investigation guidance.
  • The example README.md shows how to run the stack and how to configure a Fedify app to export metrics to it.
  • The monitoring guide from Production monitoring dashboards and alerting guide #743 links to the example files.
  • The example is tested enough to catch broken Docker Compose, invalid Prometheus rules, invalid Collector config, or invalid Grafana provisioning.

Open questions

  • Should the files live under examples/monitoring/ or docs/examples/monitoring/?
  • Should the first dashboard be a single overview dashboard, or should delivery, queues, discovery, and resource utilization be separate dashboards?
  • Should the example include a minimal Fedify app that emits sample metrics, or should it assume the user already has a Fedify app running?
  • Should we include generated screenshots in the documentation, or keep the repository source limited to runnable configs?
  • Should alert rules use only PromQL, or should we also include OpenTelemetry Collector transform/aggregation examples?
  • Should the example stack be covered by mise test:examples, a separate monitoring validation task, or lightweight config validation in CI?

Metadata

Metadata

Assignees

Type

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions