Port Lite Stage 1+2+3 analysis detectors and scheduled notifications to Dashboard by erikdarlingdata · Pull Request #997 · erikdarlingdata/PerformanceMonitor

erikdarlingdata · 2026-05-26T21:55:48Z

Brings the Dashboard analysis engine to parity with Lite across all three triage workstreams that have shipped on the Lite side:

Stage 1 — PARAMETER_SENSITIVITY and PLAN_REGRESSION detectors (PR Add plan-regression and parameter-sensitivity triage detectors (Lite) #975 merged Lite-side).
Stage 2 — scheduled analysis + finding notifications routed through the existing email/webhook channels (Lite-side committed locally, not yet on dev).
Stage 3 — BLOCKING_CHAIN reconstruction and RESOURCE_SEMAPHORE_QUERY_COMPILE wait scoring (Lite-side committed locally, not yet on dev).

Two commits — split logically the same way the design plan recommended:

c0cb2d5 Port Lite Stage 1+3 analysis detectors to Dashboard (engine + collectors + reconstructor + drill-downs + ServerIdHelper + 6-site MCP swap to deterministic FNV-1a hash + sync-checker agent)
1f08b5b Wire scheduled analysis + finding notifications (Dashboard Stage 2 — AnalysisNotificationService + FindingMessageFormatter + AnalysisScheduler + UserPreferences extensions + SettingsWindow group + RecordAlert-when-channels-absent fallback + EmailTemplateBuilder header rename)

Design history

Designed in nifty-honking-coral.md and revised across three adversarial review rounds. Round 1 caught the FormatBaselineContext / RootFactMetadata gaps; round 2 moved the .NET 10 GetHashCode randomization in-scope and restructured FindingMessageFormatter for Dashboard's AlertContext shape; round 3 caught the webhook-URL-absent silent-drop and the plain-text Recent Events header (both fixes shipped here).

After implementation, a code-reviewer pass surfaced a ServerId format mismatch (commit-history alert log used the FNV-1a int while existing alert sites used ServerConnection.Id) — fixed by resolving via IServerManager lookup, with FNV-1a fallback.

Cross-cutting Dashboard fixes that go with the port

Dashboard/Services/ServerIdHelper.cs — FNV-1a stable hash ported from Lite (RemoteCollectorService.cs:930-941). All six existing ServerName.GetHashCode() call sites in Dashboard/Mcp/McpAnalysisTools.cs swapped to it. string.GetHashCode() is randomized per process on .NET 10, so persisted config.analysis_findings.server_id and config.analysis_muted rows would not match the next launch's value. One-time reset; new writes stable across restart and consistent between MCP and scheduled paths.
Dashboard/Analysis/AnalysisModels.cs — RootFactMetadata added to both AnalysisStory and AnalysisFinding (ephemeral, not persisted). Populated in InferenceEngine.BuildStory, carried into SaveFindingsAsync — used by the finding message formatter.
Dashboard/Mcp/McpAnalysisTools.cs — FormatBaselineContext and DayNames ported from Lite onto ToolRecommendations. Fully-qualified System.Collections.Generic.Dictionary<…> to match the file's existing style (no using added).
Dashboard/Services/EmailTemplateBuilder.cs — hard-coded "RECENT EVENTS" HTML span and "--- Recent Events ---" plain-text header renamed to "DETAILS" / "--- Details ---". Generic header fits both threshold alerts and analysis findings; existing alert sites all populate AlertContext.Details with non-event content already.

Drift mitigation

Dashboard/Analysis/BlockingChainReconstructor.cs is a verbatim copy of Lite/Analysis/BlockingChainReconstructor.cs (29 diff lines, all in the header). A new user-level blocking-reconstructor-sync-checker agent at ~/.claude/agents/blocking-reconstructor-sync-checker.md flags any future drift between the two copies, mirroring the existing planalyzer-sync-checker.

Verification

dotnet build Dashboard/Dashboard.csproj -c Debug clean (0 warnings, 0 errors).
dotnet build Lite/PerformanceMonitorLite.csproj -c Debug clean.
dotnet test Lite.Tests — 260/260 pass (Lite untouched).
docs-first-verifier confirmed T-SQL constructs (DECOMPRESS + LEFT/CAST, ROW_NUMBER CTE in place of QUALIFY, OFFSET/FETCH in place of LIMIT, MAX(VALUES) in place of GREATEST, MAX(CAST(bit AS tinyint)) in place of bool_or, descendant XQuery axis, varchar literal for activity = 'blocked').
code-reviewer pass on the full diff: clean, no correctness issues.
avalonia-gotcha-reviewer: pure WPF, no Avalonia traps.

Out of scope

ANOMALY_MEMORY_PRESSURE arm that Lite has and Dashboard's ScoreAnomalyFact omits — pre-existing parity gap, separate planned workstream (dashboard-analysis-parity-sweep.md).
Dashboard.Tests source — separate planned workstream (dashboard-tests-bootstrap.md).
Persistent finding-notification cooldowns across restarts — separate planned workstream (persistent-notification-cooldowns.md).
SQL-side blocked-process-pair XML re-parse elimination — the schema half landed in Add typed blocker-side columns to blocking_BlockedProcessReport #996; the C# read-path simplification is gated by this PR landing first and is the next planned PR.

🤖 Generated with Claude Code

PR (a) of the Dashboard port — engine seams + the two new fact paths that do not depend on the scheduler/notification layer. Stage 2 (scheduled analysis + finding notifications) drops on top of this as PR (b). Lite-side workstreams ported here: Stage 1 — Parameter sensitivity and plan regression: - PARAMETER_SENSITIVITY: detects a single cached plan whose per-execution worker time varies wildly (classic parameter sniffing). Sourced from collect.query_stats; magnitude-driven scoring so a lone catastrophic offender still scores high. Drill-down surfaces the top five plans. - PLAN_REGRESSION: detects a query whose currently-active plan has per-execution cost >= 2x the best plan that query previously used. Sourced from collect.query_store_data. Uses a 14-day comparison window (independent of the standard analysis window) so the days-old "best plan" baseline is present. Stage 3 — Blocking chains and compile-wait scoring: - BlockingChainReconstructor (new, pure static) walks blocked-process pairs into chains using composite (spid, last_tran_started) session identity so a reused SPID with a different transaction start cannot fabricate a chain. The 1900-01-01 "no transaction" sentinel is normalized to NULL. - CollectBlockingChainFactsAsync re-parses blocked_process_report_xml at analysis time (no SQL-side typed columns yet) and emits one aggregate BLOCKING_CHAIN fact carrying the worst chain's apex, depth, and victim count — structure BLOCKING_EVENTS (a rate) is blind to. - Scoring is magnitude-driven: Max(depth, victims) so one severe dimension scores high without being diluted. Amplifiers cover a sleeping apex, deadlocks present, and THREADPOOL. Forward and reverse edges connect BLOCKING_CHAIN to LCK / THREADPOOL / DEADLOCKS / BLOCKING_EVENTS. - RESOURCE_SEMAPHORE_QUERY_COMPILE wait fact: already collected but unscored. Added a ramped (0.01, 0.10) threshold, compile-specific amplifiers (CPU and scheduler signals, not the runtime-grant amplifiers that would mislead), and edges to SOS_SCHEDULER_YIELD and CPU_SQL_PERCENT. Cross-cutting Dashboard fixes: - Ported Lite's GetDeterministicHashCode (FNV-1a) into a new ServerIdHelper. Swapped all six existing ServerName.GetHashCode() call sites in McpAnalysisTools to use it — string.GetHashCode() is randomized per process on .NET 10, so persisted config.analysis_findings.server_id and config.analysis_muted rows did not match the next launch's value for the same server name. New writes are stable; existing rows are a one-time reset (they were already broken across restart with the random hash). - Added RootFactMetadata to both AnalysisStory and AnalysisFinding (ephemeral, not persisted). Populated in InferenceEngine.BuildStory and carried into SaveFindingsAsync — used by Stage 2's finding message formatter (PR (b)). - Ported FormatBaselineContext and DayNames onto ToolRecommendations. Currently no caller; staged for the notification path in PR (b). T-SQL adaptations from Lite's DuckDB queries verified against MS Learn: - DECOMPRESS + LEFT(CAST(... AS NVARCHAR(MAX)), N) for query_text / query_sql_text (page-compressed varbinary(max)). - WITH ... ROW_NUMBER() OVER ... WHERE rn = 1 in place of QUALIFY. - OFFSET 0 ROWS FETCH NEXT N ROWS ONLY in place of LIMIT. - (SELECT MAX(v) FROM (VALUES (a), (b)) AS x(v)) in place of GREATEST (SQL Server has no GREATEST before 2022). - MAX(CAST(is_forced_plan AS tinyint)) in place of bool_or (bit type is invalid for MAX in T-SQL). - query_plan_hash moved into GROUP BY rather than aggregated (MS Learn's MAX docs do not list binary types; using a documented construct). - query_sql_text fetched via OUTER APPLY rather than carried through aggregations (MAX is invalid on varbinary(max)). - TOP (5000) ORDER BY collection_time DESC over collect.blocking_BlockedProcessReport — backward CIX scan, sort-free (clustered index leads on collection_time). - activity = 'blocked' (varchar literal, no N prefix) preserves the sargable predicate (an N-prefixed literal would force an nvarchar promotion of the column and break any future index seek). Drift mitigation: BlockingChainReconstructor.cs is a verbatim copy of the Lite file (29 diff lines, all in the header). A user-level blocking-reconstructor-sync-checker agent at C:\Users\edarl\.claude\agents\ flags any drift between the two copies, mirroring the existing planalyzer-sync-checker. Verification: - dotnet build Dashboard/Dashboard.csproj -c Debug: clean (0 warnings, 0 errors). - dotnet test Lite.Tests/Lite.Tests.csproj: 260/260 pass (Lite untouched). - Reconstructor body parity: in sync with the Lite worktree. - T-SQL verification via docs-first-verifier against MS Learn. - C# correctness pass via code-reviewer: no issues. Designed in a plan file and revised across three adversarial review rounds. Round 1 caught the FormatBaselineContext/RootFactMetadata gaps; round 2 moved the .NET 10 GetHashCode randomization in-scope and restructured FindingMessageFormatter for Stage 2's AlertContext shape; round 3 caught the webhook-URL-absent silent-drop and the plain-text "Recent Events" header (both fixes land in PR (b)). Stage 2 — scheduled analysis + AnalysisNotificationService + Settings UI + RecordAlert-when-channels-absent + EmailTemplateBuilder header rename — is the committed follow-on (PR (b)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR (b) of the Dashboard port — drops on top of c0cb2d5. Closes the "act" stage: high-severity findings produced by the engine in PR (a) now ride the same email/Slack/Teams channels (and Alerts history tab) as the existing threshold alerts, on a configurable schedule. Scheduler (Dashboard/Services/AnalysisScheduler.cs, new): - Owns its own DispatcherTimer — separate cadence and gating from _alertCheckTimer. Gated by AnalysisNotificationsEnabled; zero cost when off. - Per-server fresh AnalysisService (IsAnalyzing is a single-instance flag, so a shared instance whose task is abandoned on timeout would block every other server). - ConcurrentDictionary<int,byte> in-flight tracking, cleared only on real task completion via ContinueWith — a permanently-hung server is parked, not relaunched. Accepted limitation, matches Lite Stage 2. - Task.WhenAny per-server timeout with a linked CTS so the orphaned Task.Delay timer is cancelled the moment analyze wins the race. - _cycleRunning re-entrancy guard against the timer firing again while a slow cycle is still in progress. - Stop() cancels the CTS so shutdown drops out cleanly instead of waiting out the full per-server timeout window. Notifier + formatter (Dashboard/Services/AnalysisNotificationService.cs, new — sealed, with FindingMessageFormatter as internal static in the same file): - {serverId}:{StoryPathHash} cooldown so a recurring finding does not re-notify every analysis cycle. In-memory only; lost on app restart (accepted limitation, matches Lite). - Diagnosis routed into structured AlertContext.Details rows (Story, Severity, Notify threshold, Confidence, Facts, Database, Window). Dashboard's TrySendAlertEmailAsync has no detailText parameter, so the structured route replaces Lite's detailText payload — renders inside the same 600px email template the threshold alerts use. - BuildContext walks finding.DrillDown values as JsonElement via a System.Text.Json round-trip — robust to any drill-down shape. - ServerId lookup: resolve ServerConnection.Id by ServerName for the alert log so keys line up with the existing threshold-alert engine (code-reviewer caught this). Falls back to the stable int id if the server was removed between cycle start and notify. - RecordAlert fallback when no channel is configured: round-3 review caught a webhook-flag-on-with-URL-absent silent-drop. The fallback predicate now requires both the prefs flag AND the URL via WebhookAlertService.GetTeamsWebhookUrl / GetSlackWebhookUrl so the Alerts history tab still logs findings when no channel can send. Documents the asymmetric-cooldown limitation (SMTP cooldown can suppress a RecordAlert that the analysis cooldown would have allowed) as accepted. - FindingMessageFormatter stays static — notifyThreshold is threaded through as a method parameter rather than read inline (Dashboard has no App.* statics like Lite). UI + settings (Dashboard/Models/UserPreferences.cs + Dashboard/SettingsWindow.xaml + .xaml.cs): - Five new properties: AnalysisNotificationsEnabled (false), AnalysisIntervalMinutes (30), AnalysisNotifySeverity (1.5), AnalysisNotifyCooldownMinutes (360), AnalysisTimeoutSeconds (120). - Bounds enforced at consumption (AnalysisScheduler.Configure clamps interval 5-360 and timeout 30-600; NotifyAsync clamps severity 0-2 and cooldown 30-10080). UI also validates the three exposed values with the same range constants. Cooldown and timeout are intentionally preferences.json-only. - "Automated Analysis Notifications" group added after the MCP section on the General tab — enable checkbox + interval (minutes) + severity threshold (0.0-2.0). - Ships disabled. MainWindow wiring (Dashboard/MainWindow.xaml.cs): - Two new fields: _analysisNotificationService and _analysisScheduler. Names chosen to avoid colliding with _notificationService (tray service constructed in MainWindow_Loaded). - Constructed in the ctor right after _alertCheckTimer — all deps (_serverManager, _credentialService, _preferencesService, _emailAlertService) exist by that point. - _analysisScheduler.Configure() called in MainWindow_Loaded after ConfigureAlertCheckTimer, and again on settings save — paralleling the existing _alertCheckTimer pattern so a toggle in Settings takes immediate effect without restart. - _analysisScheduler.Stop() in MainWindow_Closing. EmailTemplateBuilder.cs header rename: - Hard-coded "RECENT EVENTS" was misleading for both threshold alerts (the section already carried non-event content like sessions and TempDB metrics) and findings. Renamed to "DETAILS" / "--- Details ---" in both the HTML and plain-text bodies (round-3 review caught the plain-text occurrence the round-2 patch missed). Webhook rendering is unaffected — it already flattened AlertContext.Details without using this header. Verification: - dotnet build Dashboard/Dashboard.csproj -c Debug: clean (0 warnings, 0 errors). - code-reviewer pass: surfaced ServerId-format-mismatch and a few minor items; the real bug and two minor cleanups (linked CTS, Stop() on close, asymmetric-cooldown comment) applied. - avalonia-gotcha-reviewer: confirmed pure WPF, no Avalonia traps. Three Lite stages now have full Dashboard equivalents through both PR (a) and PR (b). The original four problem scenarios — long blocking chains, parameter sensitivity / plan regressions, resource contention including the compile-wait fact — are detected by both apps and routed into the notification surface on both apps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rd alert-toast parity

erikdarlingdata and others added 4 commits May 26, 2026 11:55

Merge origin/dev — pull in #993 Lite SnoozeBalloon fix + #994 Dashboa…

a5ddffc

…rd alert-toast parity

Merge origin/dev — pull in #996 typed-column schema for blocking reports

125d48b

erikdarlingdata merged commit 45fb5d8 into dev May 26, 2026
2 checks passed

erikdarlingdata deleted the feature/dashboard-port-detectors branch May 26, 2026 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port Lite Stage 1+2+3 analysis detectors and scheduled notifications to Dashboard#997

Port Lite Stage 1+2+3 analysis detectors and scheduled notifications to Dashboard#997
erikdarlingdata merged 4 commits into
devfrom
feature/dashboard-port-detectors

erikdarlingdata commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erikdarlingdata commented May 26, 2026

Design history

Cross-cutting Dashboard fixes that go with the port

Drift mitigation

Verification

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant