fix(overview): keep current-month snapshots live and query Tenant kind correctly#5
fix(overview): keep current-month snapshots live and query Tenant kind correctly#5
Conversation
…d correctly The current-month snapshot is now regenerated from VictoriaMetrics on every /api/overview request and never written to disk. Only final, end-of-month snapshots are persisted. Previously the first request of the month froze the numbers to whatever VM reported at that moment, which is why April stayed at 43 clusters / 164 nodes / 83 tenants while Grafana kept climbing. Stale in-progress snapshots that may already be present on the snapshot PVC are ignored when loading, so the existing bad April file self-heals once the month ends (a fresh end-of-April snapshot is generated on first request in May and written to disk). Also fix the total tenants query: the application kind label is Tenant (PascalCase), not lowercase tenant, so the scalar query was always returning 0. This hid behind a workaround in the website fetcher that read apps.Tenant instead; total_tenants will now be correct on its own. Signed-off-by: tym83 <6355522@gmail.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 38 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…h cron (#508) ## Summary Manually correct the April 2026 telemetry snapshot on `cozystack.io/oss-health/telemetry/` with live VictoriaMetrics values (112 clusters / 450 nodes / 444 tenants as of 2026-04-22) and pause the daily fetch cron until the upstream server fix ships. Root cause: the telemetry-server `/api/overview` endpoint freezes the current-month snapshot at the first-of-month request. See cozystack/cozystack-telemetry-server#5 for the server fix. Until that is deployed, every `fetch-telemetry.yml` run would overwrite this correction with stale 43 / 164 / 83. ## What - `static/oss-health-data/telemetry.json`: - `month` (April 2026) → 112 / 450 / 444. Apps scaled linearly by the cluster ratio 112 / 43 ≈ 2.6047. - `quarter` and `year` averages recomputed: March derived from `(2 × old_avg − stale_april)`, new average computed with the corrected April. Apps given the same treatment. - `.github/workflows/fetch-telemetry.yml`: `schedule:` block commented out with a TODO pointing at the upstream fix. `workflow_dispatch:` kept for manual refresh. ## Why Without pausing the cron the next 08:00 UTC run would revert this commit. Keeping `workflow_dispatch:` means we can trigger the fetch on-demand once the server fix is live, to verify it returns live data before re-enabling the schedule. ## Test plan - [ ] Hugo build + Netlify preview show April 112 / 450 / 444 and scaled apps on `/oss-health/telemetry/` - [ ] GitHub Actions page shows `Fetch Telemetry Data` with only manual trigger - [ ] Once the server fix is deployed: trigger the workflow manually, confirm it produces a diff of 0 (or within rounding) against this committed JSON, then re-enable the daily schedule in a follow-up PR
There was a problem hiding this comment.
Code Review
This pull request introduces logic to distinguish between final and non-final snapshots, ensuring that data for the current month is always fresh and not persisted to disk. The review identifies two significant issues: the implementation bypasses the singleflight mechanism for the current month, which will cause redundant network queries, and the in-memory cache check for past months does not distinguish between final and non-final snapshots, potentially serving stale data until the service is restarted.
| if m.isCurrentOrFutureMonth(monthLabel) { | ||
| // In-progress month: always regenerate. collectSnapshot will choose | ||
| // in-memory-only storage because the resulting snapshot is non-final. | ||
| m.collectSnapshot(monthLabel) | ||
| return | ||
| } | ||
|
|
||
| if m.hasSnapshot(monthLabel) { | ||
| return | ||
| } |
There was a problem hiding this comment.
The current implementation of ensureSnapshot has two significant issues:
- Singleflight Bypass: For the current month, it calls
collectSnapshotdirectly and returns (lines 110-115). This bypasses the singleflight logic, meaning that multiple concurrent requests for the current month will trigger redundant, expensive network queries to VictoriaMetrics and GitHub. - Stale Memory Cache: For past months, the check
if m.hasSnapshot(monthLabel)(line 117) returns true even if the in-memory snapshot is a non-final one from when the month was still in progress. This prevents the snapshot from ever being finalized (loaded from disk or regenerated) until the service is restarted.
I recommend refactoring the logic to use singleflight for all regenerations and ensuring that only final snapshots are used as a cache for past months.
if !m.isCurrentOrFutureMonth(monthLabel) {
m.mu.RLock()
hasFinal := false
for _, s := range m.snapshots {
if s.Month == monthLabel && isFinalSnapshot(s) {
hasFinal = true
break
}
}
m.mu.RUnlock()
if hasFinal {
return
}
// Try loading from disk cache before querying VM.
if m.loadSnapshotFromDisk(monthLabel) {
return
}
}
Summary
The
/api/overviewendpoint currently freezes the current-month snapshot at the first request of the month. For April 2026 this has been returning 43 clusters / 164 nodes / 83 tenants since April 1, while live VictoriaMetrics has climbed to 112 / 450 / 444 as of 2026-04-22.Two bugs fixed:
2026-04.jsonfrom April 1) are ignored on load, so the existing file self-heals at month turnover.total_tenantswas always 0 because the scalar query usedkind="tenant"while the emitted metric label iskind="Tenant"(see client code incozystack/cozystack/internal/telemetryand docsv1/operations/configuration/telemetry.md). Fixed to match the real label.What
overview.go:ensureSnapshot: for the current/future calendar month, skip memory+disk cache and always callcollectSnapshot.collectSnapshot: persist only whenisFinalSnapshot(s); otherwise store in memory via the newstoreSnapshotInMemoryhelper.loadSnapshots+loadSnapshotFromDisk: skip non-final files so the legacy stale April file does not repopulate the cache.saveSnapshotdelegates the in-memory update tostoreSnapshotInMemory(no behaviour change).isCurrentOrFutureMonth,isFinalSnapshot.kind="tenant"→kind="Tenant".Why
Without this fix every current-month snapshot reflects whatever VM reported at the first request that month — frozen until the month ends. The website fetches this endpoint daily and exposes the stale number on
cozystack.io/oss-health/telemetry/. The Tenant-label bug hid behind a workaround in the website fetcher that readsapps.Tenantinstead oftotal_tenants; that workaround still works after this fix and can be dropped in a follow-up.Operational note
After deployment the stale
2026-04.jsonon the PVC stays, but is no longer read (isFinalSnapshot= false). On May 1 a correct end-of-April snapshot will be generated and overwrite it. No manual cleanup needed.Test plan
go build ./...andgo vet ./...pass (done locally)/api/overview?year=2026&month=04twice and confirm the response tracks live VM, not a frozen value?year=2026&month=03) and confirm it stays cachedtotal_tenantsis non-zero and matchessum(cozy_application_count{kind="Tenant"})CollectedAtpast end of April