Skip to content

Flaky test: TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, arm64) — transient 500 on the pre-deletion query #7605

@sandy2008

Description

@sandy2008

AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.

Describe the bug

The integration_querier test TestQuerierWithBlocksStorageOnMissingBlocksFromStorage intermittently fails on arm64. The failure is on the first (happy-path) query, which runs before any block is deleted and is expected to succeed:

// integration/querier_test.go
result, err := c.Query("series_1", series1Timestamp) // line 366
require.NoError(t, err)                               // line 367  <-- fails with a 500

Observed error:

querier_test.go:367:
    Error: Received unexpected error:
           server_error: server error: 500
    Test:  TestQuerierWithBlocksStorageOnMissingBlocksFromStorage

The test waits for the store-gateway to load blocks (cortex_bucket_store_blocks_loaded == 1) and for ring tokens before querying, but the store-gateway still intermittently returns a transient 500 on this initial query. Store-gateway logs around the failure show a momentary ResourceExhausted ... exceeded series limit ... (got 2) and occasional connection refused to Consul during startup, which points to a startup / eventual-consistency race rather than a real query error. The assertion line has been stable since 2020, so this is timing, not a regression.

To Reproduce

Steps to reproduce the behavior:

  1. Start Cortex (recent master)
  2. Run the integration test on arm64 (it is flaky, so multiple runs may be needed):
    go test -tags=integration,integration_querier -count=5 -run TestQuerierWithBlocksStorageOnMissingBlocksFromStorage ./integration/...
    

Expected behavior

The initial query (before block deletion) succeeds deterministically once the store-gateway reports blocks loaded and the ring has converged; no transient 500.

Environment:

  • Infrastructure: GitHub Actions CI, ubuntu-24.04-arm (arm64), integration job, tag integration_querier
  • Deployment tool: N/A (Docker-based integration test)

Additional Context

Observed on CI (both arm64, on PRs whose diff is unrelated to the querier):

Filed from CI failure-log analysis with AI assistance; the run links and querier_test.go:367 were reviewed and verified against master before submitting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions