Skip to content

feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293

Open
vparfonov wants to merge 2 commits into
openshift:masterfrom
vparfonov:log6384
Open

feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293
vparfonov wants to merge 2 commits into
openshift:masterfrom
vparfonov:log6384

Conversation

@vparfonov
Copy link
Copy Markdown
Contributor

@vparfonov vparfonov commented May 21, 2026

Description

Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding max_line_bytes).
The alert groups by namespace, app_kubernetes_io_instance, component_id, and component_type so users can identify the affected log stream.

/cc @Clee2691
/assign @jcantrill

Links

Summary by CodeRabbit

  • New Features

    • Added a monitoring alert (severity: warning) that notifies when collector sources discard logs due to oversized lines.
  • Tests

    • Added unit and functional tests to ensure the alert exists, targets source discards, and that the collector emits the discard metric when oversized lines occur.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bf8e5b0a-5c79-45f0-ba27-d53e9a457be3

📥 Commits

Reviewing files that changed from the base of the PR and between e8649e5 and 82bba44.

📒 Files selected for processing (2)
  • internal/metrics/alerts_test.go
  • test/functional/metrics/discarded_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/functional/metrics/discarded_metrics_test.go
  • internal/metrics/alerts_test.go

📝 Walkthrough

Walkthrough

Adds a PrometheusRule alert for discarded source logs, moves the discarded-events metric into the alerts allowlist, and adds unit and functional tests validating the rule and that the collector emits the metric when lines exceed max_line_bytes.

Changes

Collector Source Discarded Logs

Layer / File(s) Summary
Metric Allowlist Configuration
internal/metrics/relabel.go
vector_component_discarded_events_total is moved into the "Metrics used in alerts" section of collectorMinimalAllowlist so alert expressions may reference it.
Alert Rule Definition
config/prometheus/collector_alerts.yaml, bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
New CollectorSourceDiscardedLogs alert fires when irate(vector_component_discarded_events_total[2m]) for component_kind="source" > 0, with for: 1m, templated annotations referencing max_line_bytes, and labels service: collector, severity: warning.
Alert Validation Tests (unit)
internal/metrics/alerts_test.go
Ginkgo tests load the PrometheusRule YAML, assert metrics referenced by alert expressions exist in collectorMinimalAllowlist.allowedMetrics, and validate CollectorSourceDiscardedLogs expression contents, grouping labels, and severity: warning.
Functional metrics test
test/functional/metrics/discarded_metrics_test.go
Functional test configures collector max_line_bytes low, writes oversized audit log lines, creates necessary RBAC, and polls collector /metrics until vector_component_discarded_events_total with component_kind="source" appears.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble lines too long and watch the counter climb,
A rule now listens, measures rate, and waits a patient time,
The allowlist opens up the gate, tests search the metrics' sign,
Tiny max_line_bytes makes discarded counts align,
Hop—alerts blink bright; the rabbit leaves a rhyme.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new alert for discarded source logs, which matches the changeset's core objective.
Description check ✅ Passed The description addresses the mandatory sections including a clear summary of the intent, identifies the affected components, explains the rationale (log discards when exceeding max_line_bytes), and includes required /cc and /assign directives with a relevant JIRA link.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Clee2691 and cahartma May 21, 2026 13:05
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vparfonov
Once this PR has been reviewed and has the lgtm label, please assign xperimental for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-75: The test that iterates rule.Spec.Groups and checks the
"CollectorSourceDiscardedLogs" rule is missing an assertion for the grouped
label app_kubernetes_io_instance; inside the loop where r.Alert ==
"CollectorSourceDiscardedLogs" (using the expr variable from r.Expr.String()),
add an expectation like
Expect(expr).To(ContainSubstring("app_kubernetes_io_instance")) alongside the
existing namespace/component_id/component_type assertions so the grouping
contract is fully asserted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2254321-9b6f-4e3f-b6ae-524497669203

📥 Commits

Reviewing files that changed from the base of the PR and between c5945ee and 8a204c7.

📒 Files selected for processing (4)
  • bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
  • config/prometheus/collector_alerts.yaml
  • internal/metrics/alerts_test.go
  • internal/metrics/relabel.go

Comment thread internal/metrics/alerts_test.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-89: The tests iterate rule.Spec.Groups and check properties
when a rule with Alert == "CollectorSourceDiscardedLogs" is found but never
assert that such a rule actually existed; update both It blocks in
internal/metrics/alerts_test.go to set a found flag (e.g., foundCollector :=
false) while scanning rule.Spec.Groups/Rules for r.Alert ==
"CollectorSourceDiscardedLogs", perform the existing assertions inside that
conditional, then after the loop add Expect(foundCollector).To(BeTrue()) (or
equivalent) so the test fails when the alert is missing; reference the
rule.Spec.Groups and the alert name "CollectorSourceDiscardedLogs" when adding
the flag and final assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a62fb8e0-4737-4a08-9abf-6ca6f7136916

📥 Commits

Reviewing files that changed from the base of the PR and between 8a204c7 and 0b47d55.

📒 Files selected for processing (4)
  • bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
  • config/prometheus/collector_alerts.yaml
  • internal/metrics/alerts_test.go
  • internal/metrics/relabel.go

Comment thread internal/metrics/alerts_test.go Outdated
@jcantrill
Copy link
Copy Markdown
Contributor

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2026
@jcantrill
Copy link
Copy Markdown
Contributor

Overall, looks good to me. @vparfonov Please confirm vector generates this metric

…ource logs

Add a warning alert that fires when Vector source components discard
logs (e.g. lines exceeding max_line_bytes). The alert groups by
namespace, component_id, and component_type so users can identify
the affected log stream.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vparfonov
Copy link
Copy Markdown
Contributor Author

Overall, looks good to me. @vparfonov Please confirm vector generates this metric

added functional test

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 21-24: The test currently ignores the error returned by os.Getwd()
when setting up mdir; change the setup to check and fail on that error (e.g.,
assert/Expect that err is nil) before using mdir so setup failures are surfaced
early — specifically update the block that assigns mdir and err from os.Getwd()
and then uses path.Dir/path.Join (variables mdir and err) to call os.ReadFile,
adding an explicit Expect(err).NotTo(HaveOccurred()) or equivalent assertion
right after os.Getwd() and before modifying mdir.

In `@test/functional/metrics/discarded_metrics_test.go`:
- Around line 94-103: Change the Eventually closure to capture and assert the
RunCommand error instead of ignoring it: call metrics, err :=
framework.RunCommand(constants.CollectorName, "sh", "-c", grepDiscardCmd) inside
the func, then Assert the command succeeded with Expect(err).To(BeNil()) (or
otherwise fail the test immediately) before returning metrics; keep the
subsequent Should assertions (e.g., ContainSubstring checks) the same so each
poll both verifies successful scrape/auth and inspects /metrics output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28386eb4-9cb5-49d6-a046-eb31d1d63d91

📥 Commits

Reviewing files that changed from the base of the PR and between 0b47d55 and 5c4b4af.

📒 Files selected for processing (5)
  • bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yaml
  • config/prometheus/collector_alerts.yaml
  • internal/metrics/alerts_test.go
  • internal/metrics/relabel.go
  • test/functional/metrics/discarded_metrics_test.go

Comment thread internal/metrics/alerts_test.go Outdated
Comment thread test/functional/metrics/discarded_metrics_test.go
Comment thread internal/metrics/alerts_test.go Outdated
rule *monitoringv1.PrometheusRule
)

BeforeEach(func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can change this to "BeforeAll" if all the subsequent tests rely on upon the same file

Comment thread internal/metrics/alerts_test.go Outdated
Expect(err).NotTo(HaveOccurred())
})

It("should parse as a valid PrometheusRule with alert groups", func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved into the Before?

Comment thread internal/metrics/alerts_test.go Outdated
Write oversized and short audit log lines in a single atomic write
so Vector processes them together in one read pass. Vector only emits
component_discarded_events_total when a successfully-read line follows
discarded ones within the same read cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vparfonov
Copy link
Copy Markdown
Contributor Author

/retest

@vparfonov
Copy link
Copy Markdown
Contributor Author

/test e2e-target

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@vparfonov: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@vparfonov vparfonov requested a review from jcantrill May 27, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants