fix(overview): keep current-month snapshots live and query Tenant kind correctly by tym83 · Pull Request #5 · cozystack/cozystack-telemetry-server

tym83 · 2026-04-22T18:24:04Z

Summary

The /api/overview endpoint currently freezes the current-month snapshot at the first request of the month. For April 2026 this has been returning 43 clusters / 164 nodes / 83 tenants since April 1, while live VictoriaMetrics has climbed to 112 / 450 / 444 as of 2026-04-22.

Two bugs fixed:

Current-month snapshots are now regenerated on every request and never written to disk. Only final end-of-month snapshots are persisted. Non-final snapshots already on the PVC (e.g. the stale 2026-04.json from April 1) are ignored on load, so the existing file self-heals at month turnover.
total_tenants was always 0 because the scalar query used kind="tenant" while the emitted metric label is kind="Tenant" (see client code in cozystack/cozystack/internal/telemetry and docs v1/operations/configuration/telemetry.md). Fixed to match the real label.

What

overview.go:
- ensureSnapshot: for the current/future calendar month, skip memory+disk cache and always call collectSnapshot.
- collectSnapshot: persist only when isFinalSnapshot(s); otherwise store in memory via the new storeSnapshotInMemory helper.
- loadSnapshots + loadSnapshotFromDisk: skip non-final files so the legacy stale April file does not repopulate the cache.
- saveSnapshot delegates the in-memory update to storeSnapshotInMemory (no behaviour change).
- New helpers: isCurrentOrFutureMonth, isFinalSnapshot.
Tenant query: kind="tenant" → kind="Tenant".

Why

Without this fix every current-month snapshot reflects whatever VM reported at the first request that month — frozen until the month ends. The website fetches this endpoint daily and exposes the stale number on cozystack.io/oss-health/telemetry/. The Tenant-label bug hid behind a workaround in the website fetcher that reads apps.Tenant instead of total_tenants; that workaround still works after this fix and can be dropped in a follow-up.

Operational note

After deployment the stale 2026-04.json on the PVC stays, but is no longer read (isFinalSnapshot = false). On May 1 a correct end-of-April snapshot will be generated and overwrite it. No manual cleanup needed.

Test plan

go build ./... and go vet ./... pass (done locally)
Deploy to staging; request /api/overview?year=2026&month=04 twice and confirm the response tracks live VM, not a frozen value
Request a past month (e.g. ?year=2026&month=03) and confirm it stays cached
Verify total_tenants is non-zero and matches sum(cozy_application_count{kind="Tenant"})
After 2026-05-01, confirm April snapshot on disk has been regenerated with CollectedAt past end of April

…d correctly The current-month snapshot is now regenerated from VictoriaMetrics on every /api/overview request and never written to disk. Only final, end-of-month snapshots are persisted. Previously the first request of the month froze the numbers to whatever VM reported at that moment, which is why April stayed at 43 clusters / 164 nodes / 83 tenants while Grafana kept climbing. Stale in-progress snapshots that may already be present on the snapshot PVC are ignored when loading, so the existing bad April file self-heals once the month ends (a fresh end-of-April snapshot is generated on first request in May and written to disk). Also fix the total tenants query: the application kind label is Tenant (PascalCase), not lowercase tenant, so the scalar query was always returning 0. This hid behind a workaround in the website fetcher that read apps.Tenant instead; total_tenants will now be correct on its own. Signed-off-by: tym83 <6355522@gmail.com>

coderabbitai · 2026-04-22T18:24:12Z

Warning

Rate limit exceeded

@tym83 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 38 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 38 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 67d4ff77-8114-485d-9f33-d1127dde9018

📥 Commits

Reviewing files that changed from the base of the PR and between e2bb204 and a51ffbb.

📒 Files selected for processing (1)

overview.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/current-month-live-and-tenant-label

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…h cron (#508) ## Summary Manually correct the April 2026 telemetry snapshot on `cozystack.io/oss-health/telemetry/` with live VictoriaMetrics values (112 clusters / 450 nodes / 444 tenants as of 2026-04-22) and pause the daily fetch cron until the upstream server fix ships. Root cause: the telemetry-server `/api/overview` endpoint freezes the current-month snapshot at the first-of-month request. See cozystack/cozystack-telemetry-server#5 for the server fix. Until that is deployed, every `fetch-telemetry.yml` run would overwrite this correction with stale 43 / 164 / 83. ## What - `static/oss-health-data/telemetry.json`: - `month` (April 2026) → 112 / 450 / 444. Apps scaled linearly by the cluster ratio 112 / 43 ≈ 2.6047. - `quarter` and `year` averages recomputed: March derived from `(2 × old_avg − stale_april)`, new average computed with the corrected April. Apps given the same treatment. - `.github/workflows/fetch-telemetry.yml`: `schedule:` block commented out with a TODO pointing at the upstream fix. `workflow_dispatch:` kept for manual refresh. ## Why Without pausing the cron the next 08:00 UTC run would revert this commit. Keeping `workflow_dispatch:` means we can trigger the fetch on-demand once the server fix is live, to verify it returns live data before re-enabling the schedule. ## Test plan - [ ] Hugo build + Netlify preview show April 112 / 450 / 444 and scaled apps on `/oss-health/telemetry/` - [ ] GitHub Actions page shows `Fetch Telemetry Data` with only manual trigger - [ ] Once the server fix is deployed: trigger the workflow manually, confirm it produces a diff of 0 (or within rounding) against this committed JSON, then re-enable the daily schedule in a follow-up PR

gemini-code-assist

Code Review

This pull request introduces logic to distinguish between final and non-final snapshots, ensuring that data for the current month is always fresh and not persisted to disk. The review identifies two significant issues: the implementation bypasses the singleflight mechanism for the current month, which will cause redundant network queries, and the in-memory cache check for past months does not distinguish between final and non-final snapshots, potentially serving stale data until the service is restarted.

gemini-code-assist · 2026-04-22T18:27:04Z

+	if m.isCurrentOrFutureMonth(monthLabel) {
+		// In-progress month: always regenerate. collectSnapshot will choose
+		// in-memory-only storage because the resulting snapshot is non-final.
+		m.collectSnapshot(monthLabel)
+		return
+	}
+
 	if m.hasSnapshot(monthLabel) {
 		return
 	}


The current implementation of ensureSnapshot has two significant issues:

Singleflight Bypass: For the current month, it calls collectSnapshot directly and returns (lines 110-115). This bypasses the singleflight logic, meaning that multiple concurrent requests for the current month will trigger redundant, expensive network queries to VictoriaMetrics and GitHub.

Stale Memory Cache: For past months, the check if m.hasSnapshot(monthLabel) (line 117) returns true even if the in-memory snapshot is a non-final one from when the month was still in progress. This prevents the snapshot from ever being finalized (loaded from disk or regenerated) until the service is restarted.

I recommend refactoring the logic to use singleflight for all regenerations and ensuring that only final snapshots are used as a cache for past months.

if !m.isCurrentOrFutureMonth(monthLabel) { m.mu.RLock() hasFinal := false for _, s := range m.snapshots { if s.Month == monthLabel && isFinalSnapshot(s) { hasFinal = true break } } m.mu.RUnlock() if hasFinal { return } // Try loading from disk cache before querying VM. if m.loadSnapshotFromDisk(monthLabel) { return } }

tym83 mentioned this pull request Apr 22, 2026

fix(oss-health): correct April 2026 telemetry snapshot and pause fetch cron cozystack/website#508

Merged

3 tasks

tym83 marked this pull request as ready for review April 22, 2026 18:26

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(overview): keep current-month snapshots live and query Tenant kind correctly#5

fix(overview): keep current-month snapshots live and query Tenant kind correctly#5
tym83 wants to merge 1 commit intomainfrom
fix/current-month-live-and-tenant-label

tym83 commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tym83 commented Apr 22, 2026

Summary

What

Why

Operational note

Test plan

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading