Skip to content

Flaky test: TestPrometheusMetrics_IncrementedCorrectly (pkg/util/queryeviction) — race between async eviction and evictionsTotal assertion #7604

@sandy2008

Description

@sandy2008

AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.

Describe the bug

TestPrometheusMetrics_IncrementedCorrectly (pkg/util/queryeviction) intermittently fails with the evictionsTotal counter reading one less than expected:

--- FAIL: TestPrometheusMetrics_IncrementedCorrectly (0.03s)
    evictor_test.go:210:
        Error: Not equal: expected: 3  actual: 2

The test starts an asynchronous evictor (startEvictor, which runs a background loop on CheckInterval), drives 3 evictions via waitEvicted, then asserts the metric:

// pkg/util/queryeviction/evictor_test.go
for i := range 3 {
	_, evicted := registerTestQuery(reg, uint64(1000+i), "q", "user")
	waitEvicted(t, evicted)
}
assert.Equal(t, float64(3), promtest.ToFloat64(evictor.evictionsTotal.WithLabelValues(string(resource.CPU)))) // line 210

evictionsTotal is incremented by the background eviction goroutine. waitEvicted signals that an eviction occurred, but the assertion can observe 2 when the metric increment for the third eviction has not been published yet — i.e., the test synchronizes on the eviction signal but not on the metric write.

(This package was introduced by #7488, "Add Query Resource Based Eviction".)

To Reproduce

Steps to reproduce the behavior:

  1. Start Cortex (recent master)
  2. Run the test repeatedly:
    go test -count=200 -run TestPrometheusMetrics_IncrementedCorrectly ./pkg/util/queryeviction/
    

Expected behavior

The test passes deterministically; the assertion observes all 3 eviction metric increments (the test should synchronize on the metric being published, not only on the eviction signal).

Environment:

  • Infrastructure: GitHub Actions CI, ubuntu-24.04 (amd64), test job
  • Deployment tool: N/A (Go unit test)

Additional Context

Observed on CI (2026-06-07): https://github.com/cortexproject/cortex/actions/runs/27093523437 (job test (amd64)). The PR under test only modified pkg/scheduler, unrelated to pkg/util/queryeviction, confirming this is a flake rather than a regression.

Filed from CI failure-log analysis with AI assistance; the run link and evictor_test.go:210 were reviewed and verified against master before submitting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions