feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293
feat(metrics): add CollectorSourceDiscardedLogs alert for discarded source logs#3293vparfonov wants to merge 2 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a PrometheusRule alert for discarded source logs, moves the discarded-events metric into the alerts allowlist, and adds unit and functional tests validating the rule and that the collector emits the metric when lines exceed max_line_bytes. ChangesCollector Source Discarded Logs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vparfonov The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-75: The test that iterates rule.Spec.Groups and checks the
"CollectorSourceDiscardedLogs" rule is missing an assertion for the grouped
label app_kubernetes_io_instance; inside the loop where r.Alert ==
"CollectorSourceDiscardedLogs" (using the expr variable from r.Expr.String()),
add an expectation like
Expect(expr).To(ContainSubstring("app_kubernetes_io_instance")) alongside the
existing namespace/component_id/component_type assertions so the grouping
contract is fully asserted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f2254321-9b6f-4e3f-b6ae-524497669203
📒 Files selected for processing (4)
bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yamlconfig/prometheus/collector_alerts.yamlinternal/metrics/alerts_test.gointernal/metrics/relabel.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 67-89: The tests iterate rule.Spec.Groups and check properties
when a rule with Alert == "CollectorSourceDiscardedLogs" is found but never
assert that such a rule actually existed; update both It blocks in
internal/metrics/alerts_test.go to set a found flag (e.g., foundCollector :=
false) while scanning rule.Spec.Groups/Rules for r.Alert ==
"CollectorSourceDiscardedLogs", perform the existing assertions inside that
conditional, then after the loop add Expect(foundCollector).To(BeTrue()) (or
equivalent) so the test fails when the alert is missing; reference the
rule.Spec.Groups and the alert name "CollectorSourceDiscardedLogs" when adding
the flag and final assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a62fb8e0-4737-4a08-9abf-6ca6f7136916
📒 Files selected for processing (4)
bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yamlconfig/prometheus/collector_alerts.yamlinternal/metrics/alerts_test.gointernal/metrics/relabel.go
|
/hold |
|
Overall, looks good to me. @vparfonov Please confirm vector generates this metric |
…ource logs Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding max_line_bytes). The alert groups by namespace, component_id, and component_type so users can identify the affected log stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
added functional test |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/alerts_test.go`:
- Around line 21-24: The test currently ignores the error returned by os.Getwd()
when setting up mdir; change the setup to check and fail on that error (e.g.,
assert/Expect that err is nil) before using mdir so setup failures are surfaced
early — specifically update the block that assigns mdir and err from os.Getwd()
and then uses path.Dir/path.Join (variables mdir and err) to call os.ReadFile,
adding an explicit Expect(err).NotTo(HaveOccurred()) or equivalent assertion
right after os.Getwd() and before modifying mdir.
In `@test/functional/metrics/discarded_metrics_test.go`:
- Around line 94-103: Change the Eventually closure to capture and assert the
RunCommand error instead of ignoring it: call metrics, err :=
framework.RunCommand(constants.CollectorName, "sh", "-c", grepDiscardCmd) inside
the func, then Assert the command succeeded with Expect(err).To(BeNil()) (or
otherwise fail the test immediately) before returning metrics; keep the
subsequent Should assertions (e.g., ContainSubstring checks) the same so each
poll both verifies successful scrape/auth and inspects /metrics output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 28386eb4-9cb5-49d6-a046-eb31d1d63d91
📒 Files selected for processing (5)
bundle/manifests/collector_monitoring.coreos.com_v1_prometheusrule.yamlconfig/prometheus/collector_alerts.yamlinternal/metrics/alerts_test.gointernal/metrics/relabel.gotest/functional/metrics/discarded_metrics_test.go
| rule *monitoringv1.PrometheusRule | ||
| ) | ||
|
|
||
| BeforeEach(func() { |
There was a problem hiding this comment.
Probably can change this to "BeforeAll" if all the subsequent tests rely on upon the same file
| Expect(err).NotTo(HaveOccurred()) | ||
| }) | ||
|
|
||
| It("should parse as a valid PrometheusRule with alert groups", func() { |
There was a problem hiding this comment.
Should this be moved into the Before?
Write oversized and short audit log lines in a single atomic write so Vector processes them together in one read pass. Vector only emits component_discarded_events_total when a successfully-read line follows discarded ones within the same read cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/retest |
|
/test e2e-target |
|
@vparfonov: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Add a warning alert that fires when Vector source components discard logs (e.g. lines exceeding
max_line_bytes).The alert groups by
namespace,app_kubernetes_io_instance,component_id, andcomponent_typeso users can identify the affected log stream./cc @Clee2691
/assign @jcantrill
Links
Summary by CodeRabbit
New Features
Tests