diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index a7da9faf..0fbd81c0 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -125,6 +125,7 @@ For complex investigation tasks, use these skills (read the skill file for detai |-------|----------|----------| | **codebase-researcher** | `.github/skills/codebase-researcher/SKILL.md` | "where is X implemented", "how does Y work", "trace the flow of", data flow investigation | | **incident-investigator** | `.github/skills/incident-investigator/SKILL.md` | IcM incidents, customer-reported issues, authentication failures | +| **aria-alert-investigator** | `.github/skills/aria-alert-investigator/SKILL.md` | "Aria detected an incident", "investigate Aria alert", "health metric incident", telemetry threshold breach | | **kusto-analyst** | `.github/skills/kusto-analyst/SKILL.md` | "query Kusto", "analyze telemetry", "check android_spans", eSTS correlation, latency investigation | | **feature-planner** | `.github/skills/feature-planner/SKILL.md` | "plan this feature", "break this down into PBIs", "decompose this into tasks", feature decomposition | | **pbi-creator** | `.github/skills/pbi-creator/SKILL.md` | "create the PBIs", "create work items", "push PBIs to ADO", approved plan → ADO work items | diff --git a/.github/skills/aria-alert-investigator/SKILL.md b/.github/skills/aria-alert-investigator/SKILL.md new file mode 100644 index 00000000..f045e827 --- /dev/null +++ b/.github/skills/aria-alert-investigator/SKILL.md @@ -0,0 +1,203 @@ +--- +name: aria-alert-investigator +description: Investigate Aria health-metric alerts (threshold-based IcMs of the form "Aria detected an incident in for "). Use this skill when an IcM was triggered by Aria's threshold monitor on android_spans / android_metrics — NOT for customer-reported authentication failures. Triggers include "investigate Aria alert", "Aria detected an incident", "health metric incident", "telemetry threshold breach", or any IcM where the title starts with "Aria detected". +--- + +# Aria Alert Investigator + +Investigate Aria health-metric alerts evidence-first, without anchoring on guesses. + +## When to use this skill vs. the others + +| Incident shape | Use | +|----------------|-----| +| Customer/partner reports auth failure, missing PRT, sign-out loop, etc. | `incident-investigator` | +| You need to write Kusto queries for any reason | `kusto-analyst` (this skill calls into it) | +| **IcM title is "Aria detected an incident in `` for ``"** | **This skill** | +| User says "investigate this Aria alert" / "health metric IcM fired" | **This skill** | + +Aria alerts are **threshold-based monitors on telemetry** — they fire when a metric value crosses a pre-set baseline, not user-reported problems. The investigation pattern is fundamentally different — there is often no customer, no error chain, and no log file. The signal is "the curve moved" and the job is to determine whether the underlying data actually moved, and if so, why. + +--- + +## Core principles + +### Principle 1 — Never guess the metric definition silently + +Aria health-metric names are marketing-style labels (`SDM - timed_out_execution`, `failed_get_current_account_operation_count`). They do **not** uniquely identify a Kusto slice. + +**Rule:** Before running any investigation query, attempt to decode the metric name into a `(table, span_name, error_code, dimension filters, …)` tuple. If multiple candidates plausibly match, **ask the user to confirm the exact slice** before proceeding. Only fall back to a best-evidenced guess if the user does not know — and if you do, **flag the assumption explicitly in every subsequent finding** ("assuming `span_name == X`, …"). + +### Principle 2 — Treat the alert value as opaque + +Aria reports a number (e.g., `4.48 in the Red band`). This may be a count, rate, percentile, z-score, or a synthesized statistic. Without the cube definition, **do not reverse-engineer it**. Do not say "this means 4 events." Use it only as a directional signal that "Aria thinks something deviated." + +### Principle 3 — Look at the trend from multiple angles before concluding anything + +A single view will fool you. The same daily count can mean very different things depending on traffic, device base, and seasonality. Before deciding whether a real spike exists, check the trend from several independent angles. + +Good angles to combine (pick the ones relevant to the metric): +- **Volume signals**: raw error count, total request volume on the same operation +- **Normalized signals**: error rate (errors ÷ attempts), affected device rate (error devices ÷ active devices) — these strip out the effect of traffic changes +- **Time signals**: hourly view around the alert (catches single-device retry storms), same day-of-week comparison (catches weekly seasonality), longer history to define the historical band + +**Why normalized signals matter:** error count rises can come from real regressions OR from traffic growth. Error rate isolates the former. Affected device rate further separates "one device retrying a lot" from "many devices each failing once" — these mean very different things. + +If none of the angles you checked shows a deviation outside the historical band, say exactly that and stop: + +> "I don't see a deviation in `` over the last N days. The alert grain on `` sits within the historical `` band." + +You **may** offer an opinion after stating the data finding — but only when clearly separated and flagged as opinion, not as a data-driven conclusion. Example: + +> "**Data finding:** [the statement above] +> +> **My read (low confidence — based on absence of evidence rather than positive evidence):** This looks consistent with detector noise on a low-volume metric. The team has resolved several past ICMs in this metric family the same way (threshold widened, no code change — see Step 2 results). However, I cannot rule out a real issue you have context for; please confirm before closing." + +Rules for opinions: +- State the data finding **first and separately**. +- Explicitly flag confidence (`low / medium / high`) and **why** — if it's based on absence of evidence rather than positive signal, say so. +- Do not assert "this is noise" or "this is a false positive" as fact. Frame as "this looks like" / "consistent with" / "my read is". +- Always invite the user to override with context the data does not show. + +### Principle 4 — Past ICMs are pattern signal, not verdict + +Pull past Aria ICMs in the same metric family. If they show a recurring resolution ("acknowledged as duplicate, threshold widened, no code change"), call that out as **prior pattern** — not as the answer for this one. Each alert is its own investigation. + +### Principle 5 — Compare same-day-of-week for seasonal metrics + +SDM, broker, and most auth metrics have strong weekly seasonality (workplace flows). Day-over-day comparisons against a weekend baseline produce false rises on Mondays/Tuesdays. Always include a "same day-of-week, last N weeks" view for any metric that follows business-hour patterns. + +### Principle 6 — Code context is the baseline, not an extra step + +Span names, error codes, operations, and dimensions are opaque strings without the code that emits them. Use the [`codebase-researcher`](../codebase-researcher/SKILL.md) skill **continuously throughout the workflow** to understand what each signal actually means — what an `error_code` is emitted for, what a `span_name` represents, what a `broker_operation` does, what changed between two `broker_version`s. + +Do this whenever an unfamiliar attribute, error code, or operation appears in the data. The data only tells you *that* something moved; the code tells you *what it would mean* if it did. + +--- + +## Workflow + +Execute steps in order. Do not skip steps. + +### Step 1 — Decode the metric → KQL slice (BLOCKING) + +Read the IcM ticket and capture: **project / cube name**, **metric name**, **alert value** (keep opaque), **band**, **timestamp**, and any **past ICMs already linked**. + +Then map the metric name to a Kusto filter using these heuristics: + +| Metric name fragment | Likely filter | +|----------------------|---------------| +| `SDM`, `shared_device`, `SharedDevice` | `is_shared_device == true` | +| `failed__count`, `failed__operation_count` | `error_code != ""` AND span/operation matching `` | +| `timed_out_execution` | `error_code == "timed_out_execution"` | +| `_count` | Some span_name or broker_operation matching the operation | +| `_failures` | `span_name == ` AND `error_code != ""` | + +**List every candidate slice** you generated and **ask the user to confirm**. Use `vscode_askQuestions` with one option per candidate. Example: + +``` +askQuestion({ + question: "Multiple Kusto slices could back the metric 'SDM - timed_out_execution'. Which one is it?", + options: [ + { label: "is_shared_device==true AND span_name=='AcquireTokenSilent' AND error_code=='timed_out_execution'" }, + { label: "is_shared_device==true AND span_name=='ATISilently' AND error_code=='timed_out_execution'" }, + { label: "is_shared_device==true AND error_code=='timed_out_execution' (any span)" }, + { label: "I don't know — make your best guess" } + ] +}) +``` + +**Do not run any data queries until this step completes.** A wrong slice will silently invalidate every step below. + +If the user picks "I don't know," pick the candidate with the strongest evidence and **state the assumption explicitly** at the top of every subsequent finding. + +### Step 2 — Search for past ICMs in the metric family (parallel with Step 3) + +The IcM's "duplicate of …" links from Step 1 capture what was already noted. This step goes further: actively search the DRI Copilot index for past Aria alerts on the **same or sibling** metrics, even if they aren't linked from the current ticket. + +Skip this step only if Step 1 already surfaced 3+ past ICMs in the same metric family with consistent resolutions. + +``` +mcp_mydricopilot_Android_DRI_Copilot_Project_Explorer( + message="Find past Aria alert ICMs related to in . For each, return root cause, mitigation, and whether it required code change." +) +``` + +Report what the family pattern looks like (do not draw conclusions about the current ICM from it). + +### Step 3 — Investigate the trend (Principle 3) + +Delegate all Kusto query construction and execution to the [`kusto-analyst`](../kusto-analyst/SKILL.md) skill. This skill's job is to decide **what** to investigate; `kusto-analyst` decides **how** to query for it. + +Investigate freely. Combine whatever angles best characterize this metric — at minimum understand absolute volume, normalized rates (so traffic shifts can't fool you), time/seasonality, and single-device dominance. Use the angle table in Principle 3 as a starting menu, not a checklist. + +If one of the angles shows a clear deviation, ask `kusto-analyst` to drill into the dimension(s) most likely to explain it. Useful slicing dimensions: + +| Dimension | What it isolates | +|-----------|------------------| +| `broker_version` | Version-specific regressions | +| `calling_package_name` | App-specific regressions (Teams, Outlook, etc.) | +| `DeviceInfo_Make` / `DeviceInfo_Model` | OEM-specific issues | +| `DeviceInfo_OsVersion` | OS-version-specific issues | +| `tenant_id` | Single-tenant issues | +| `error_message` / `error_location` | Sub-categorization within the same `error_code` | + +When drilling, ask for **rates within each dimension value** (events ÷ total attempts in that slice), not raw counts — otherwise traffic-share shifts will mislead you. + +### Step 4 — Report what you found (Principle 3) + +Use this template. State data findings first, then optionally offer an opinion in the separate "My read" section. + +```markdown +## Investigation: IcM — Aria alert on + +### Slice used +`` + + +### Past ICMs in this metric family + + +### Trend analysis + + +### What I see in the data + +- "I don't see a deviation in `` over the last N days. The alert grain on sits within the historical band." +- "Error rate moved from to on . Affected device rate also moved (). [continue for each angle that moved]" + +### My read (optional, only if you have one) + +- **Confidence**: low / medium / high +- **Based on**: positive evidence / absence of evidence / pattern match to past ICMs / etc. +- **My read**: +- **What would change my mind**: + +### Suggested next steps for the user + +- Confirm the slice if it was assumed +- Slice the data by `` if you suspect a specific population +- Compare against telemetry on a sibling metric `` +- Check the cube definition in the Aria portal +``` + +--- + +## Tool reference + +### DRI Copilot +- `mcp_mydricopilot_Android_DRI_Copilot_Project_Explorer` — Get IcM context, find similar past ICMs, search TSGs + +### Kusto +- Delegate to the [`kusto-analyst`](../kusto-analyst/SKILL.md) skill for all KQL work — cluster/database/table identifiers, field schema, and query construction live there. + +--- + diff --git a/.github/skills/kusto-analyst/SKILL.md b/.github/skills/kusto-analyst/SKILL.md index a0083f75..b40f3685 100644 --- a/.github/skills/kusto-analyst/SKILL.md +++ b/.github/skills/kusto-analyst/SKILL.md @@ -7,6 +7,10 @@ description: Analyze Android authentication telemetry using Azure Data Explorer Analyze Android authentication telemetry using Azure Data Explorer (Kusto) for error analysis, latency investigation, and cross-cluster correlation. +## Working an Aria health-metric alert? + +If the user is investigating an IcM titled "Aria detected an incident in `` for ``", use the [`aria-alert-investigator`](../aria-alert-investigator/SKILL.md) skill. It defines the canonical four-view query pattern (raw count, rate per 1k requests, rate per 1k devices, same-day-of-week) and the rules for confirming the metric slice before running queries. This skill provides the underlying Kusto reference but not the Aria-specific workflow. + ## Available MCP Tools **Always use these tools to execute Kusto queries:** @@ -252,7 +256,7 @@ android_spans ### Android-Specific Filtering -**⚠️ ALWAYS filter by Android platform:** +**⚠️ ALWAYS filter by Android platform unless an explicit platform is specified:** ```kql AllPerRequestTable | where env_time >= ago(7d)