Skip to content

Port Lite Stage 1+2+3 analysis detectors and scheduled notifications to Dashboard#997

Merged
erikdarlingdata merged 4 commits into
devfrom
feature/dashboard-port-detectors
May 26, 2026
Merged

Port Lite Stage 1+2+3 analysis detectors and scheduled notifications to Dashboard#997
erikdarlingdata merged 4 commits into
devfrom
feature/dashboard-port-detectors

Conversation

@erikdarlingdata
Copy link
Copy Markdown
Owner

Brings the Dashboard analysis engine to parity with Lite across all three triage workstreams that have shipped on the Lite side:

  • Stage 1 — PARAMETER_SENSITIVITY and PLAN_REGRESSION detectors (PR Add plan-regression and parameter-sensitivity triage detectors (Lite) #975 merged Lite-side).
  • Stage 2 — scheduled analysis + finding notifications routed through the existing email/webhook channels (Lite-side committed locally, not yet on dev).
  • Stage 3 — BLOCKING_CHAIN reconstruction and RESOURCE_SEMAPHORE_QUERY_COMPILE wait scoring (Lite-side committed locally, not yet on dev).

Two commits — split logically the same way the design plan recommended:

  • c0cb2d5 Port Lite Stage 1+3 analysis detectors to Dashboard (engine + collectors + reconstructor + drill-downs + ServerIdHelper + 6-site MCP swap to deterministic FNV-1a hash + sync-checker agent)
  • 1f08b5b Wire scheduled analysis + finding notifications (Dashboard Stage 2 — AnalysisNotificationService + FindingMessageFormatter + AnalysisScheduler + UserPreferences extensions + SettingsWindow group + RecordAlert-when-channels-absent fallback + EmailTemplateBuilder header rename)

Design history

Designed in nifty-honking-coral.md and revised across three adversarial review rounds. Round 1 caught the FormatBaselineContext / RootFactMetadata gaps; round 2 moved the .NET 10 GetHashCode randomization in-scope and restructured FindingMessageFormatter for Dashboard's AlertContext shape; round 3 caught the webhook-URL-absent silent-drop and the plain-text Recent Events header (both fixes shipped here).

After implementation, a code-reviewer pass surfaced a ServerId format mismatch (commit-history alert log used the FNV-1a int while existing alert sites used ServerConnection.Id) — fixed by resolving via IServerManager lookup, with FNV-1a fallback.

Cross-cutting Dashboard fixes that go with the port

  • Dashboard/Services/ServerIdHelper.cs — FNV-1a stable hash ported from Lite (RemoteCollectorService.cs:930-941). All six existing ServerName.GetHashCode() call sites in Dashboard/Mcp/McpAnalysisTools.cs swapped to it. string.GetHashCode() is randomized per process on .NET 10, so persisted config.analysis_findings.server_id and config.analysis_muted rows would not match the next launch's value. One-time reset; new writes stable across restart and consistent between MCP and scheduled paths.
  • Dashboard/Analysis/AnalysisModels.csRootFactMetadata added to both AnalysisStory and AnalysisFinding (ephemeral, not persisted). Populated in InferenceEngine.BuildStory, carried into SaveFindingsAsync — used by the finding message formatter.
  • Dashboard/Mcp/McpAnalysisTools.csFormatBaselineContext and DayNames ported from Lite onto ToolRecommendations. Fully-qualified System.Collections.Generic.Dictionary<…> to match the file's existing style (no using added).
  • Dashboard/Services/EmailTemplateBuilder.cs — hard-coded "RECENT EVENTS" HTML span and "--- Recent Events ---" plain-text header renamed to "DETAILS" / "--- Details ---". Generic header fits both threshold alerts and analysis findings; existing alert sites all populate AlertContext.Details with non-event content already.

Drift mitigation

Dashboard/Analysis/BlockingChainReconstructor.cs is a verbatim copy of Lite/Analysis/BlockingChainReconstructor.cs (29 diff lines, all in the header). A new user-level blocking-reconstructor-sync-checker agent at ~/.claude/agents/blocking-reconstructor-sync-checker.md flags any future drift between the two copies, mirroring the existing planalyzer-sync-checker.

Verification

  • dotnet build Dashboard/Dashboard.csproj -c Debug clean (0 warnings, 0 errors).
  • dotnet build Lite/PerformanceMonitorLite.csproj -c Debug clean.
  • dotnet test Lite.Tests — 260/260 pass (Lite untouched).
  • docs-first-verifier confirmed T-SQL constructs (DECOMPRESS + LEFT/CAST, ROW_NUMBER CTE in place of QUALIFY, OFFSET/FETCH in place of LIMIT, MAX(VALUES) in place of GREATEST, MAX(CAST(bit AS tinyint)) in place of bool_or, descendant XQuery axis, varchar literal for activity = 'blocked').
  • code-reviewer pass on the full diff: clean, no correctness issues.
  • avalonia-gotcha-reviewer: pure WPF, no Avalonia traps.

Out of scope

  • ANOMALY_MEMORY_PRESSURE arm that Lite has and Dashboard's ScoreAnomalyFact omits — pre-existing parity gap, separate planned workstream (dashboard-analysis-parity-sweep.md).
  • Dashboard.Tests source — separate planned workstream (dashboard-tests-bootstrap.md).
  • Persistent finding-notification cooldowns across restarts — separate planned workstream (persistent-notification-cooldowns.md).
  • SQL-side blocked-process-pair XML re-parse elimination — the schema half landed in Add typed blocker-side columns to blocking_BlockedProcessReport #996; the C# read-path simplification is gated by this PR landing first and is the next planned PR.

🤖 Generated with Claude Code

erikdarlingdata and others added 4 commits May 26, 2026 11:55
PR (a) of the Dashboard port — engine seams + the two new fact paths that
do not depend on the scheduler/notification layer. Stage 2 (scheduled
analysis + finding notifications) drops on top of this as PR (b).

Lite-side workstreams ported here:

Stage 1 — Parameter sensitivity and plan regression:
- PARAMETER_SENSITIVITY: detects a single cached plan whose per-execution
  worker time varies wildly (classic parameter sniffing). Sourced from
  collect.query_stats; magnitude-driven scoring so a lone catastrophic
  offender still scores high. Drill-down surfaces the top five plans.
- PLAN_REGRESSION: detects a query whose currently-active plan has
  per-execution cost >= 2x the best plan that query previously used.
  Sourced from collect.query_store_data. Uses a 14-day comparison window
  (independent of the standard analysis window) so the days-old "best
  plan" baseline is present.

Stage 3 — Blocking chains and compile-wait scoring:
- BlockingChainReconstructor (new, pure static) walks blocked-process
  pairs into chains using composite (spid, last_tran_started) session
  identity so a reused SPID with a different transaction start cannot
  fabricate a chain. The 1900-01-01 "no transaction" sentinel is
  normalized to NULL.
- CollectBlockingChainFactsAsync re-parses blocked_process_report_xml
  at analysis time (no SQL-side typed columns yet) and emits one
  aggregate BLOCKING_CHAIN fact carrying the worst chain's apex, depth,
  and victim count — structure BLOCKING_EVENTS (a rate) is blind to.
- Scoring is magnitude-driven: Max(depth, victims) so one severe
  dimension scores high without being diluted. Amplifiers cover a
  sleeping apex, deadlocks present, and THREADPOOL. Forward and reverse
  edges connect BLOCKING_CHAIN to LCK / THREADPOOL / DEADLOCKS /
  BLOCKING_EVENTS.
- RESOURCE_SEMAPHORE_QUERY_COMPILE wait fact: already collected but
  unscored. Added a ramped (0.01, 0.10) threshold, compile-specific
  amplifiers (CPU and scheduler signals, not the runtime-grant
  amplifiers that would mislead), and edges to SOS_SCHEDULER_YIELD
  and CPU_SQL_PERCENT.

Cross-cutting Dashboard fixes:
- Ported Lite's GetDeterministicHashCode (FNV-1a) into a new
  ServerIdHelper. Swapped all six existing ServerName.GetHashCode()
  call sites in McpAnalysisTools to use it — string.GetHashCode() is
  randomized per process on .NET 10, so persisted
  config.analysis_findings.server_id and config.analysis_muted rows
  did not match the next launch's value for the same server name.
  New writes are stable; existing rows are a one-time reset
  (they were already broken across restart with the random hash).
- Added RootFactMetadata to both AnalysisStory and AnalysisFinding
  (ephemeral, not persisted). Populated in InferenceEngine.BuildStory
  and carried into SaveFindingsAsync — used by Stage 2's finding
  message formatter (PR (b)).
- Ported FormatBaselineContext and DayNames onto ToolRecommendations.
  Currently no caller; staged for the notification path in PR (b).

T-SQL adaptations from Lite's DuckDB queries verified against MS Learn:
- DECOMPRESS + LEFT(CAST(... AS NVARCHAR(MAX)), N) for query_text /
  query_sql_text (page-compressed varbinary(max)).
- WITH ... ROW_NUMBER() OVER ... WHERE rn = 1 in place of QUALIFY.
- OFFSET 0 ROWS FETCH NEXT N ROWS ONLY in place of LIMIT.
- (SELECT MAX(v) FROM (VALUES (a), (b)) AS x(v)) in place of GREATEST
  (SQL Server has no GREATEST before 2022).
- MAX(CAST(is_forced_plan AS tinyint)) in place of bool_or (bit type
  is invalid for MAX in T-SQL).
- query_plan_hash moved into GROUP BY rather than aggregated (MS Learn's
  MAX docs do not list binary types; using a documented construct).
- query_sql_text fetched via OUTER APPLY rather than carried through
  aggregations (MAX is invalid on varbinary(max)).
- TOP (5000) ORDER BY collection_time DESC over
  collect.blocking_BlockedProcessReport — backward CIX scan, sort-free
  (clustered index leads on collection_time).
- activity = 'blocked' (varchar literal, no N prefix) preserves the
  sargable predicate (an N-prefixed literal would force an nvarchar
  promotion of the column and break any future index seek).

Drift mitigation: BlockingChainReconstructor.cs is a verbatim copy of
the Lite file (29 diff lines, all in the header). A user-level
blocking-reconstructor-sync-checker agent at
C:\Users\edarl\.claude\agents\ flags any drift between the two
copies, mirroring the existing planalyzer-sync-checker.

Verification:
- dotnet build Dashboard/Dashboard.csproj -c Debug: clean (0 warnings,
  0 errors).
- dotnet test Lite.Tests/Lite.Tests.csproj: 260/260 pass (Lite untouched).
- Reconstructor body parity: in sync with the Lite worktree.
- T-SQL verification via docs-first-verifier against MS Learn.
- C# correctness pass via code-reviewer: no issues.

Designed in a plan file and revised across three adversarial review
rounds. Round 1 caught the FormatBaselineContext/RootFactMetadata gaps;
round 2 moved the .NET 10 GetHashCode randomization in-scope and
restructured FindingMessageFormatter for Stage 2's AlertContext shape;
round 3 caught the webhook-URL-absent silent-drop and the plain-text
"Recent Events" header (both fixes land in PR (b)).

Stage 2 — scheduled analysis + AnalysisNotificationService + Settings UI
+ RecordAlert-when-channels-absent + EmailTemplateBuilder header rename
— is the committed follow-on (PR (b)).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR (b) of the Dashboard port — drops on top of c0cb2d5. Closes the "act"
stage: high-severity findings produced by the engine in PR (a) now ride
the same email/Slack/Teams channels (and Alerts history tab) as the
existing threshold alerts, on a configurable schedule.

Scheduler (Dashboard/Services/AnalysisScheduler.cs, new):
- Owns its own DispatcherTimer — separate cadence and gating from
  _alertCheckTimer. Gated by AnalysisNotificationsEnabled; zero cost
  when off.
- Per-server fresh AnalysisService (IsAnalyzing is a single-instance
  flag, so a shared instance whose task is abandoned on timeout would
  block every other server).
- ConcurrentDictionary<int,byte> in-flight tracking, cleared only on
  real task completion via ContinueWith — a permanently-hung server is
  parked, not relaunched. Accepted limitation, matches Lite Stage 2.
- Task.WhenAny per-server timeout with a linked CTS so the orphaned
  Task.Delay timer is cancelled the moment analyze wins the race.
- _cycleRunning re-entrancy guard against the timer firing again
  while a slow cycle is still in progress.
- Stop() cancels the CTS so shutdown drops out cleanly instead of
  waiting out the full per-server timeout window.

Notifier + formatter (Dashboard/Services/AnalysisNotificationService.cs,
new — sealed, with FindingMessageFormatter as internal static in the
same file):
- {serverId}:{StoryPathHash} cooldown so a recurring finding does not
  re-notify every analysis cycle. In-memory only; lost on app restart
  (accepted limitation, matches Lite).
- Diagnosis routed into structured AlertContext.Details rows (Story,
  Severity, Notify threshold, Confidence, Facts, Database, Window).
  Dashboard's TrySendAlertEmailAsync has no detailText parameter, so
  the structured route replaces Lite's detailText payload — renders
  inside the same 600px email template the threshold alerts use.
- BuildContext walks finding.DrillDown values as JsonElement via a
  System.Text.Json round-trip — robust to any drill-down shape.
- ServerId lookup: resolve ServerConnection.Id by ServerName for the
  alert log so keys line up with the existing threshold-alert engine
  (code-reviewer caught this). Falls back to the stable int id if the
  server was removed between cycle start and notify.
- RecordAlert fallback when no channel is configured: round-3 review
  caught a webhook-flag-on-with-URL-absent silent-drop. The fallback
  predicate now requires both the prefs flag AND the URL via
  WebhookAlertService.GetTeamsWebhookUrl / GetSlackWebhookUrl so the
  Alerts history tab still logs findings when no channel can send.
  Documents the asymmetric-cooldown limitation (SMTP cooldown can
  suppress a RecordAlert that the analysis cooldown would have
  allowed) as accepted.
- FindingMessageFormatter stays static — notifyThreshold is threaded
  through as a method parameter rather than read inline (Dashboard
  has no App.* statics like Lite).

UI + settings (Dashboard/Models/UserPreferences.cs +
Dashboard/SettingsWindow.xaml + .xaml.cs):
- Five new properties: AnalysisNotificationsEnabled (false),
  AnalysisIntervalMinutes (30), AnalysisNotifySeverity (1.5),
  AnalysisNotifyCooldownMinutes (360), AnalysisTimeoutSeconds (120).
- Bounds enforced at consumption (AnalysisScheduler.Configure clamps
  interval 5-360 and timeout 30-600; NotifyAsync clamps severity 0-2
  and cooldown 30-10080). UI also validates the three exposed values
  with the same range constants. Cooldown and timeout are intentionally
  preferences.json-only.
- "Automated Analysis Notifications" group added after the MCP section
  on the General tab — enable checkbox + interval (minutes) + severity
  threshold (0.0-2.0).
- Ships disabled.

MainWindow wiring (Dashboard/MainWindow.xaml.cs):
- Two new fields: _analysisNotificationService and _analysisScheduler.
  Names chosen to avoid colliding with _notificationService (tray
  service constructed in MainWindow_Loaded).
- Constructed in the ctor right after _alertCheckTimer — all deps
  (_serverManager, _credentialService, _preferencesService,
  _emailAlertService) exist by that point.
- _analysisScheduler.Configure() called in MainWindow_Loaded after
  ConfigureAlertCheckTimer, and again on settings save — paralleling
  the existing _alertCheckTimer pattern so a toggle in Settings takes
  immediate effect without restart.
- _analysisScheduler.Stop() in MainWindow_Closing.

EmailTemplateBuilder.cs header rename:
- Hard-coded "RECENT EVENTS" was misleading for both threshold alerts
  (the section already carried non-event content like sessions and
  TempDB metrics) and findings. Renamed to "DETAILS" / "--- Details ---"
  in both the HTML and plain-text bodies (round-3 review caught the
  plain-text occurrence the round-2 patch missed). Webhook rendering
  is unaffected — it already flattened AlertContext.Details without
  using this header.

Verification:
- dotnet build Dashboard/Dashboard.csproj -c Debug: clean (0 warnings,
  0 errors).
- code-reviewer pass: surfaced ServerId-format-mismatch and a few
  minor items; the real bug and two minor cleanups (linked CTS,
  Stop() on close, asymmetric-cooldown comment) applied.
- avalonia-gotcha-reviewer: confirmed pure WPF, no Avalonia traps.

Three Lite stages now have full Dashboard equivalents through both
PR (a) and PR (b). The original four problem scenarios — long blocking
chains, parameter sensitivity / plan regressions, resource contention
including the compile-wait fact — are detected by both apps and routed
into the notification surface on both apps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erikdarlingdata erikdarlingdata merged commit 45fb5d8 into dev May 26, 2026
2 checks passed
@erikdarlingdata erikdarlingdata deleted the feature/dashboard-port-detectors branch May 26, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant