Skip to content

OpenTelemetry metrics for ActivityPub fanout and retry behavior #742

@dahlia

Description

@dahlia

Summary

Add ActivityPub-specific metrics for fanout, queued activity processing, and retry behavior. This should cover the parts of federation activity flow that are not captured by the core delivery attempt and duration metrics proposed in #619.

Current state

#619 covers the first layer of federation health metrics: outbound delivery attempts, delivery duration, permanent delivery failure, inbox processing duration, and signature verification failures.

Fedify still needs visibility into the work that happens around those delivery attempts. Fanout can expand one local activity into many remote inbox deliveries. Retries can turn one logical send into several queued attempts. Without metrics for those steps, operators can see that delivery is slow or failing, but not whether the pressure comes from fanout size, retry volume, or activity type mix.

Proposed solution

Once #619 adds metrics support, add metrics around ActivityPub fanout and retry behavior.

Proposed instruments:

  • activitypub.fanout.tasks: counter, incremented when a fanout task is processed.
  • activitypub.fanout.recipients: histogram, recording the number of target inboxes produced by a fanout task.
  • activitypub.outbox.retry: counter, incremented when Fedify schedules a retry for an outbound activity.
  • activitypub.inbox.activity: counter, incremented when an inbound activity is accepted for processing, rejected, queued, or processed.
  • activitypub.outbox.activity: counter, incremented when an outbound activity is queued, sent, retried, failed, or abandoned.

Proposed attributes:

  • activitypub.activity.type: low-cardinality ActivityStreams type name such as Create, Follow, Undo, or Announce.
  • activitypub.queue.role: inbox, outbox, or fanout.
  • activitypub.processing.result: queued, processed, sent, retried, rejected, failed, or abandoned.
  • activitypub.shared_inbox: whether a delivery target is a shared inbox, where available.
  • http.response.status_code, only for delivery failures with an HTTP response.

Avoid actor IDs, activity IDs, inbox URLs, and object IDs as metric attributes.

Scope

  • Add fanout task and recipient-count metrics to @fedify/fedify.
  • Add retry counters for Fedify-managed retries. Backends with native retrial should not be double-counted unless Fedify can observe the retry event.
  • Add inbox and outbox activity counters at the points where Fedify knows the activity type and processing result.
  • Do not duplicate the core delivery attempt, delivery duration, or permanent failure metrics from OpenTelemetry metrics and span events for federation health #619.
  • Update docs/manual/opentelemetry.md with the new instruments and attribute rules.

Acceptance criteria

Open questions

  • Should inbox and outbox activity counters use separate instruments, or one instrument with activitypub.queue.role?
  • Should activity type use short names like Create, or full ActivityStreams IRIs?
  • How should native-retry backends expose retry counts, if at all?

Metadata

Metadata

Assignees

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions