Port Lite Stage 1+2+3 analysis detectors and scheduled notifications to Dashboard#997
Merged
Merged
Conversation
PR (a) of the Dashboard port — engine seams + the two new fact paths that do not depend on the scheduler/notification layer. Stage 2 (scheduled analysis + finding notifications) drops on top of this as PR (b). Lite-side workstreams ported here: Stage 1 — Parameter sensitivity and plan regression: - PARAMETER_SENSITIVITY: detects a single cached plan whose per-execution worker time varies wildly (classic parameter sniffing). Sourced from collect.query_stats; magnitude-driven scoring so a lone catastrophic offender still scores high. Drill-down surfaces the top five plans. - PLAN_REGRESSION: detects a query whose currently-active plan has per-execution cost >= 2x the best plan that query previously used. Sourced from collect.query_store_data. Uses a 14-day comparison window (independent of the standard analysis window) so the days-old "best plan" baseline is present. Stage 3 — Blocking chains and compile-wait scoring: - BlockingChainReconstructor (new, pure static) walks blocked-process pairs into chains using composite (spid, last_tran_started) session identity so a reused SPID with a different transaction start cannot fabricate a chain. The 1900-01-01 "no transaction" sentinel is normalized to NULL. - CollectBlockingChainFactsAsync re-parses blocked_process_report_xml at analysis time (no SQL-side typed columns yet) and emits one aggregate BLOCKING_CHAIN fact carrying the worst chain's apex, depth, and victim count — structure BLOCKING_EVENTS (a rate) is blind to. - Scoring is magnitude-driven: Max(depth, victims) so one severe dimension scores high without being diluted. Amplifiers cover a sleeping apex, deadlocks present, and THREADPOOL. Forward and reverse edges connect BLOCKING_CHAIN to LCK / THREADPOOL / DEADLOCKS / BLOCKING_EVENTS. - RESOURCE_SEMAPHORE_QUERY_COMPILE wait fact: already collected but unscored. Added a ramped (0.01, 0.10) threshold, compile-specific amplifiers (CPU and scheduler signals, not the runtime-grant amplifiers that would mislead), and edges to SOS_SCHEDULER_YIELD and CPU_SQL_PERCENT. Cross-cutting Dashboard fixes: - Ported Lite's GetDeterministicHashCode (FNV-1a) into a new ServerIdHelper. Swapped all six existing ServerName.GetHashCode() call sites in McpAnalysisTools to use it — string.GetHashCode() is randomized per process on .NET 10, so persisted config.analysis_findings.server_id and config.analysis_muted rows did not match the next launch's value for the same server name. New writes are stable; existing rows are a one-time reset (they were already broken across restart with the random hash). - Added RootFactMetadata to both AnalysisStory and AnalysisFinding (ephemeral, not persisted). Populated in InferenceEngine.BuildStory and carried into SaveFindingsAsync — used by Stage 2's finding message formatter (PR (b)). - Ported FormatBaselineContext and DayNames onto ToolRecommendations. Currently no caller; staged for the notification path in PR (b). T-SQL adaptations from Lite's DuckDB queries verified against MS Learn: - DECOMPRESS + LEFT(CAST(... AS NVARCHAR(MAX)), N) for query_text / query_sql_text (page-compressed varbinary(max)). - WITH ... ROW_NUMBER() OVER ... WHERE rn = 1 in place of QUALIFY. - OFFSET 0 ROWS FETCH NEXT N ROWS ONLY in place of LIMIT. - (SELECT MAX(v) FROM (VALUES (a), (b)) AS x(v)) in place of GREATEST (SQL Server has no GREATEST before 2022). - MAX(CAST(is_forced_plan AS tinyint)) in place of bool_or (bit type is invalid for MAX in T-SQL). - query_plan_hash moved into GROUP BY rather than aggregated (MS Learn's MAX docs do not list binary types; using a documented construct). - query_sql_text fetched via OUTER APPLY rather than carried through aggregations (MAX is invalid on varbinary(max)). - TOP (5000) ORDER BY collection_time DESC over collect.blocking_BlockedProcessReport — backward CIX scan, sort-free (clustered index leads on collection_time). - activity = 'blocked' (varchar literal, no N prefix) preserves the sargable predicate (an N-prefixed literal would force an nvarchar promotion of the column and break any future index seek). Drift mitigation: BlockingChainReconstructor.cs is a verbatim copy of the Lite file (29 diff lines, all in the header). A user-level blocking-reconstructor-sync-checker agent at C:\Users\edarl\.claude\agents\ flags any drift between the two copies, mirroring the existing planalyzer-sync-checker. Verification: - dotnet build Dashboard/Dashboard.csproj -c Debug: clean (0 warnings, 0 errors). - dotnet test Lite.Tests/Lite.Tests.csproj: 260/260 pass (Lite untouched). - Reconstructor body parity: in sync with the Lite worktree. - T-SQL verification via docs-first-verifier against MS Learn. - C# correctness pass via code-reviewer: no issues. Designed in a plan file and revised across three adversarial review rounds. Round 1 caught the FormatBaselineContext/RootFactMetadata gaps; round 2 moved the .NET 10 GetHashCode randomization in-scope and restructured FindingMessageFormatter for Stage 2's AlertContext shape; round 3 caught the webhook-URL-absent silent-drop and the plain-text "Recent Events" header (both fixes land in PR (b)). Stage 2 — scheduled analysis + AnalysisNotificationService + Settings UI + RecordAlert-when-channels-absent + EmailTemplateBuilder header rename — is the committed follow-on (PR (b)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR (b) of the Dashboard port — drops on top of c0cb2d5. Closes the "act" stage: high-severity findings produced by the engine in PR (a) now ride the same email/Slack/Teams channels (and Alerts history tab) as the existing threshold alerts, on a configurable schedule. Scheduler (Dashboard/Services/AnalysisScheduler.cs, new): - Owns its own DispatcherTimer — separate cadence and gating from _alertCheckTimer. Gated by AnalysisNotificationsEnabled; zero cost when off. - Per-server fresh AnalysisService (IsAnalyzing is a single-instance flag, so a shared instance whose task is abandoned on timeout would block every other server). - ConcurrentDictionary<int,byte> in-flight tracking, cleared only on real task completion via ContinueWith — a permanently-hung server is parked, not relaunched. Accepted limitation, matches Lite Stage 2. - Task.WhenAny per-server timeout with a linked CTS so the orphaned Task.Delay timer is cancelled the moment analyze wins the race. - _cycleRunning re-entrancy guard against the timer firing again while a slow cycle is still in progress. - Stop() cancels the CTS so shutdown drops out cleanly instead of waiting out the full per-server timeout window. Notifier + formatter (Dashboard/Services/AnalysisNotificationService.cs, new — sealed, with FindingMessageFormatter as internal static in the same file): - {serverId}:{StoryPathHash} cooldown so a recurring finding does not re-notify every analysis cycle. In-memory only; lost on app restart (accepted limitation, matches Lite). - Diagnosis routed into structured AlertContext.Details rows (Story, Severity, Notify threshold, Confidence, Facts, Database, Window). Dashboard's TrySendAlertEmailAsync has no detailText parameter, so the structured route replaces Lite's detailText payload — renders inside the same 600px email template the threshold alerts use. - BuildContext walks finding.DrillDown values as JsonElement via a System.Text.Json round-trip — robust to any drill-down shape. - ServerId lookup: resolve ServerConnection.Id by ServerName for the alert log so keys line up with the existing threshold-alert engine (code-reviewer caught this). Falls back to the stable int id if the server was removed between cycle start and notify. - RecordAlert fallback when no channel is configured: round-3 review caught a webhook-flag-on-with-URL-absent silent-drop. The fallback predicate now requires both the prefs flag AND the URL via WebhookAlertService.GetTeamsWebhookUrl / GetSlackWebhookUrl so the Alerts history tab still logs findings when no channel can send. Documents the asymmetric-cooldown limitation (SMTP cooldown can suppress a RecordAlert that the analysis cooldown would have allowed) as accepted. - FindingMessageFormatter stays static — notifyThreshold is threaded through as a method parameter rather than read inline (Dashboard has no App.* statics like Lite). UI + settings (Dashboard/Models/UserPreferences.cs + Dashboard/SettingsWindow.xaml + .xaml.cs): - Five new properties: AnalysisNotificationsEnabled (false), AnalysisIntervalMinutes (30), AnalysisNotifySeverity (1.5), AnalysisNotifyCooldownMinutes (360), AnalysisTimeoutSeconds (120). - Bounds enforced at consumption (AnalysisScheduler.Configure clamps interval 5-360 and timeout 30-600; NotifyAsync clamps severity 0-2 and cooldown 30-10080). UI also validates the three exposed values with the same range constants. Cooldown and timeout are intentionally preferences.json-only. - "Automated Analysis Notifications" group added after the MCP section on the General tab — enable checkbox + interval (minutes) + severity threshold (0.0-2.0). - Ships disabled. MainWindow wiring (Dashboard/MainWindow.xaml.cs): - Two new fields: _analysisNotificationService and _analysisScheduler. Names chosen to avoid colliding with _notificationService (tray service constructed in MainWindow_Loaded). - Constructed in the ctor right after _alertCheckTimer — all deps (_serverManager, _credentialService, _preferencesService, _emailAlertService) exist by that point. - _analysisScheduler.Configure() called in MainWindow_Loaded after ConfigureAlertCheckTimer, and again on settings save — paralleling the existing _alertCheckTimer pattern so a toggle in Settings takes immediate effect without restart. - _analysisScheduler.Stop() in MainWindow_Closing. EmailTemplateBuilder.cs header rename: - Hard-coded "RECENT EVENTS" was misleading for both threshold alerts (the section already carried non-event content like sessions and TempDB metrics) and findings. Renamed to "DETAILS" / "--- Details ---" in both the HTML and plain-text bodies (round-3 review caught the plain-text occurrence the round-2 patch missed). Webhook rendering is unaffected — it already flattened AlertContext.Details without using this header. Verification: - dotnet build Dashboard/Dashboard.csproj -c Debug: clean (0 warnings, 0 errors). - code-reviewer pass: surfaced ServerId-format-mismatch and a few minor items; the real bug and two minor cleanups (linked CTS, Stop() on close, asymmetric-cooldown comment) applied. - avalonia-gotcha-reviewer: confirmed pure WPF, no Avalonia traps. Three Lite stages now have full Dashboard equivalents through both PR (a) and PR (b). The original four problem scenarios — long blocking chains, parameter sensitivity / plan regressions, resource contention including the compile-wait fact — are detected by both apps and routed into the notification surface on both apps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rd alert-toast parity
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings the Dashboard analysis engine to parity with Lite across all three triage workstreams that have shipped on the Lite side:
Two commits — split logically the same way the design plan recommended:
c0cb2d5Port Lite Stage 1+3 analysis detectors to Dashboard (engine + collectors + reconstructor + drill-downs + ServerIdHelper + 6-site MCP swap to deterministic FNV-1a hash + sync-checker agent)1f08b5bWire scheduled analysis + finding notifications (Dashboard Stage 2 — AnalysisNotificationService + FindingMessageFormatter + AnalysisScheduler + UserPreferences extensions + SettingsWindow group + RecordAlert-when-channels-absent fallback + EmailTemplateBuilder header rename)Design history
Designed in
nifty-honking-coral.mdand revised across three adversarial review rounds. Round 1 caught the FormatBaselineContext / RootFactMetadata gaps; round 2 moved the.NET 10GetHashCoderandomization in-scope and restructuredFindingMessageFormatterfor Dashboard'sAlertContextshape; round 3 caught the webhook-URL-absent silent-drop and the plain-textRecent Eventsheader (both fixes shipped here).After implementation, a
code-reviewerpass surfaced aServerIdformat mismatch (commit-history alert log used the FNV-1a int while existing alert sites usedServerConnection.Id) — fixed by resolving viaIServerManagerlookup, with FNV-1a fallback.Cross-cutting Dashboard fixes that go with the port
Dashboard/Services/ServerIdHelper.cs— FNV-1a stable hash ported from Lite (RemoteCollectorService.cs:930-941). All six existingServerName.GetHashCode()call sites inDashboard/Mcp/McpAnalysisTools.csswapped to it.string.GetHashCode()is randomized per process on .NET 10, so persistedconfig.analysis_findings.server_idandconfig.analysis_mutedrows would not match the next launch's value. One-time reset; new writes stable across restart and consistent between MCP and scheduled paths.Dashboard/Analysis/AnalysisModels.cs—RootFactMetadataadded to bothAnalysisStoryandAnalysisFinding(ephemeral, not persisted). Populated inInferenceEngine.BuildStory, carried intoSaveFindingsAsync— used by the finding message formatter.Dashboard/Mcp/McpAnalysisTools.cs—FormatBaselineContextandDayNamesported from Lite ontoToolRecommendations. Fully-qualifiedSystem.Collections.Generic.Dictionary<…>to match the file's existing style (nousingadded).Dashboard/Services/EmailTemplateBuilder.cs— hard-coded"RECENT EVENTS"HTML span and"--- Recent Events ---"plain-text header renamed to"DETAILS"/"--- Details ---". Generic header fits both threshold alerts and analysis findings; existing alert sites all populateAlertContext.Detailswith non-event content already.Drift mitigation
Dashboard/Analysis/BlockingChainReconstructor.csis a verbatim copy ofLite/Analysis/BlockingChainReconstructor.cs(29 diff lines, all in the header). A new user-levelblocking-reconstructor-sync-checkeragent at~/.claude/agents/blocking-reconstructor-sync-checker.mdflags any future drift between the two copies, mirroring the existingplanalyzer-sync-checker.Verification
dotnet build Dashboard/Dashboard.csproj -c Debugclean (0 warnings, 0 errors).dotnet build Lite/PerformanceMonitorLite.csproj -c Debugclean.dotnet test Lite.Tests— 260/260 pass (Lite untouched).docs-first-verifierconfirmed T-SQL constructs (DECOMPRESS + LEFT/CAST, ROW_NUMBER CTE in place of QUALIFY, OFFSET/FETCH in place of LIMIT,MAX(VALUES)in place ofGREATEST,MAX(CAST(bit AS tinyint))in place ofbool_or, descendant XQuery axis,varcharliteral foractivity = 'blocked').code-reviewerpass on the full diff: clean, no correctness issues.avalonia-gotcha-reviewer: pure WPF, no Avalonia traps.Out of scope
ANOMALY_MEMORY_PRESSUREarm that Lite has and Dashboard'sScoreAnomalyFactomits — pre-existing parity gap, separate planned workstream (dashboard-analysis-parity-sweep.md).Dashboard.Testssource — separate planned workstream (dashboard-tests-bootstrap.md).persistent-notification-cooldowns.md).🤖 Generated with Claude Code