diff --git a/docs/README.skills.md b/docs/README.skills.md
index b66b3bd76..5b8983bf7 100644
--- a/docs/README.skills.md
+++ b/docs/README.skills.md
@@ -45,15 +45,15 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [arch-linux-triage](../skills/arch-linux-triage/SKILL.md)<br />`gh skills install github/awesome-copilot arch-linux-triage` | Triage and resolve Arch Linux issues with pacman, systemd, and rolling-release best practices. | None |
 | [architecture-blueprint-generator](../skills/architecture-blueprint-generator/SKILL.md)<br />`gh skills install github/awesome-copilot architecture-blueprint-generator` | Comprehensive project architecture blueprint generator that analyzes codebases to create detailed architectural documentation. Automatically detects technology stacks and architectural patterns, generates visual diagrams, documents implementation patterns, and provides extensible blueprints for maintaining architectural consistency and guiding new development. | None |
 | [arduino-azure-iot-edge-integration](../skills/arduino-azure-iot-edge-integration/SKILL.md)<br />`gh skills install github/awesome-copilot arduino-azure-iot-edge-integration` | Design and implement Arduino integration with Azure IoT Hub and IoT Edge, including secure provisioning, resilient telemetry, command handling, and production guardrails. | `references/arduino-iot-checklist.md`<br />`references/arduino-official-best-practices.md` |
-| [arize-ai-provider-integration](../skills/arize-ai-provider-integration/SKILL.md)<br />`gh skills install github/awesome-copilot arize-ai-provider-integration` | INVOKE THIS SKILL when creating, reading, updating, or deleting Arize AI integrations. Covers listing integrations, creating integrations for any supported LLM provider (OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Gemini, NVIDIA NIM, custom), updating credentials or metadata, and deleting integrations using the ax CLI. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
-| [arize-annotation](../skills/arize-annotation/SKILL.md)<br />`gh skills install github/awesome-copilot arize-annotation` | INVOKE THIS SKILL when creating, managing, or using annotation configs or annotation queues on Arize (categorical, continuous, freeform), or applying human annotations to project spans via the Python SDK. Configs are the label schema for human feedback; queues are review workflows that route records to annotators. Triggers: annotation config, annotation queue, label schema, human feedback schema, bulk annotate spans, update_annotations, labeling queue, annotate record. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
-| [arize-dataset](../skills/arize-dataset/SKILL.md)<br />`gh skills install github/awesome-copilot arize-dataset` | INVOKE THIS SKILL when creating, managing, or querying Arize datasets and examples. Also use when the user needs test data or evaluation examples for their model. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
-| [arize-evaluator](../skills/arize-evaluator/SKILL.md)<br />`gh skills install github/awesome-copilot arize-evaluator` | INVOKE THIS SKILL for LLM-as-judge evaluation workflows on Arize: creating/updating evaluators, running evaluations on spans or experiments, tasks, trigger-run, column mapping, and continuous monitoring. Use when the user says: create an evaluator, LLM judge, hallucination/faithfulness/correctness/relevance, run eval, score my spans or experiment, ax tasks, trigger-run, trigger eval, column mapping, continuous monitoring, query filter for evals, evaluator version, or improve an evaluator prompt. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
-| [arize-experiment](../skills/arize-experiment/SKILL.md)<br />`gh skills install github/awesome-copilot arize-experiment` | INVOKE THIS SKILL when creating, running, or analyzing Arize experiments. Also use when the user wants to evaluate or measure model performance, compare models (including GPT-4, Claude, or others), or assess how well their AI is doing. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
-| [arize-instrumentation](../skills/arize-instrumentation/SKILL.md)<br />`gh skills install github/awesome-copilot arize-instrumentation` | INVOKE THIS SKILL when adding Arize AX tracing or observability to an app for the first time, or when the user wants to instrument their LLM app or get started with LLM observability. Follow the Agent-Assisted Tracing two-phase flow: analyze the codebase (read-only), then implement after user confirmation. When the app uses LLM tool/function calling, add manual CHAIN + TOOL spans. Leverages https://arize.com/docs/ax/alyx/tracing-assistant and https://arize.com/docs/PROMPT.md. | `references/ax-profiles.md` |
-| [arize-link](../skills/arize-link/SKILL.md)<br />`gh skills install github/awesome-copilot arize-link` | Generate deep links to the Arize UI. Use when the user wants a clickable URL to open or share a specific trace, span, session, dataset, labeling queue, evaluator, or annotation config, or when sharing Arize resources with team members. | `references/EXAMPLES.md` |
-| [arize-prompt-optimization](../skills/arize-prompt-optimization/SKILL.md)<br />`gh skills install github/awesome-copilot arize-prompt-optimization` | INVOKE THIS SKILL when optimizing, improving, or debugging LLM prompts using production trace data, evaluations, and annotations. Also use when the user wants to make their AI respond better or improve AI output quality. Covers extracting prompts from spans, gathering performance signal, and running a data-driven optimization loop using the ax CLI. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
-| [arize-trace](../skills/arize-trace/SKILL.md)<br />`gh skills install github/awesome-copilot arize-trace` | INVOKE THIS SKILL when downloading, exporting, or inspecting Arize traces and spans, or when a user wants to look at what their LLM app is doing using existing trace data, or when an already-instrumented app has a bug or error to investigate. Use for debugging unknown runtime issues, failures, and behavior regressions. Covers exporting traces by ID, spans by ID, sessions by ID, and root-cause investigation with the ax CLI. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
+| [arize-ai-provider-integration](../skills/arize-ai-provider-integration/SKILL.md)<br />`gh skills install github/awesome-copilot arize-ai-provider-integration` | Creates, reads, updates, and deletes Arize AI integrations that store LLM provider credentials used by evaluators and other Arize features. Supports any LLM provider (e.g. OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Gemini, NVIDIA NIM). Use when the user mentions AI integration, LLM provider credentials, create integration, list integrations, update credentials, delete integration, or connecting an LLM provider to Arize. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
+| [arize-annotation](../skills/arize-annotation/SKILL.md)<br />`gh skills install github/awesome-copilot arize-annotation` | Creates and manages annotation configs (categorical, continuous, freeform label schemas) and annotation queues (human review workflows) on Arize. Applies human annotations to project spans via the Python SDK. Use when the user mentions annotation config, annotation queue, label schema, human feedback, bulk annotate spans, update_annotations, labeling queue, annotate record, or human review. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
+| [arize-dataset](../skills/arize-dataset/SKILL.md)<br />`gh skills install github/awesome-copilot arize-dataset` | Creates, manages, and queries Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI. Use when the user needs test data, evaluation examples, or mentions create dataset, list datasets, export dataset, append examples, dataset version, golden dataset, or test set. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
+| [arize-evaluator](../skills/arize-evaluator/SKILL.md)<br />`gh skills install github/awesome-copilot arize-evaluator` | Handles LLM-as-judge evaluation workflows on Arize including creating/updating evaluators, running evaluations on spans or experiments, managing tasks, trigger-run operations, column mapping, and continuous monitoring. Use when the user mentions create evaluator, LLM judge, hallucination, faithfulness, correctness, relevance, run eval, score spans, score experiment, trigger-run, column mapping, continuous monitoring, or improve evaluator prompt. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
+| [arize-experiment](../skills/arize-experiment/SKILL.md)<br />`gh skills install github/awesome-copilot arize-experiment` | Creates, runs, and analyzes Arize experiments for evaluating and comparing model performance. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI. Use when the user mentions create experiment, run experiment, compare models, model performance, evaluate AI, experiment results, benchmark, A/B test models, or measure accuracy. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
+| [arize-instrumentation](../skills/arize-instrumentation/SKILL.md)<br />`gh skills install github/awesome-copilot arize-instrumentation` | Adds Arize AX tracing to an LLM application for the first time. Follows a two-phase agent-assisted flow to analyze the codebase then implement instrumentation after user confirmation. Use when the user wants to instrument their app, add tracing from scratch, set up LLM observability, integrate OpenTelemetry or openinference, or get started with Arize tracing. | `references/ax-profiles.md` |
+| [arize-link](../skills/arize-link/SKILL.md)<br />`gh skills install github/awesome-copilot arize-link` | Generates deep links to the Arize UI for traces, spans, sessions, datasets, labeling queues, evaluators, and annotation configs. Produces clickable URLs for sharing Arize resources with team members. Use when the user wants to link to or open a trace, span, session, dataset, evaluator, or annotation config in the Arize UI. | `references/EXAMPLES.md` |
+| [arize-prompt-optimization](../skills/arize-prompt-optimization/SKILL.md)<br />`gh skills install github/awesome-copilot arize-prompt-optimization` | Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
+| [arize-trace](../skills/arize-trace/SKILL.md)<br />`gh skills install github/awesome-copilot arize-trace` | Downloads, exports, and inspects existing Arize traces and spans to understand what an LLM app is doing or debug runtime issues. Covers exporting traces by ID, spans by ID, sessions by ID, and root-cause investigation using the ax CLI. Use when the user wants to look at existing trace data, see what their LLM app is doing, export traces, download spans, investigate errors, or analyze behavior regressions. | `references/ax-profiles.md`<br />`references/ax-setup.md` |
 | [aspire](../skills/aspire/SKILL.md)<br />`gh skills install github/awesome-copilot aspire` | Aspire skill covering the Aspire CLI, AppHost orchestration, service discovery, integrations, MCP server, VS Code extension, Dev Containers, GitHub Codespaces, templates, dashboard, and deployment. Use when the user asks to create, run, debug, configure, deploy, or troubleshoot an Aspire distributed application. | `references/architecture.md`<br />`references/cli-reference.md`<br />`references/dashboard.md`<br />`references/deployment.md`<br />`references/integrations-catalog.md`<br />`references/mcp-server.md`<br />`references/polyglot-apis.md`<br />`references/testing.md`<br />`references/troubleshooting.md` |
 | [aspnet-minimal-api-openapi](../skills/aspnet-minimal-api-openapi/SKILL.md)<br />`gh skills install github/awesome-copilot aspnet-minimal-api-openapi` | Create ASP.NET Minimal API endpoints with proper OpenAPI documentation | None |
 | [audit-integrity](../skills/audit-integrity/SKILL.md)<br />`gh skills install github/awesome-copilot audit-integrity` | Shared audit integrity framework for all AppSec agents — enforces output quality, intellectual honesty, and continuous improvement through anti-rationalization guards, self-critique loops, retry protocols, non-negotiable behaviors, self-reflection quality gates (1-10 scoring, ≥8 threshold), and a self-learning system with lesson/memory governance for security analysis agents. | `references/anti-rationalization-guard.md`<br />`references/clarification-protocol.md`<br />`references/non-negotiable-behaviors.md`<br />`references/retry-protocol.md`<br />`references/self-critique-loop.md`<br />`references/self-learning-system.md`<br />`references/self-reflection-quality-gate.md` |
diff --git a/skills/arize-ai-provider-integration/SKILL.md b/skills/arize-ai-provider-integration/SKILL.md
index dbf2fc169..806be8e59 100644
--- a/skills/arize-ai-provider-integration/SKILL.md
+++ b/skills/arize-ai-provider-integration/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-ai-provider-integration
-description: "INVOKE THIS SKILL when creating, reading, updating, or deleting Arize AI integrations. Covers listing integrations, creating integrations for any supported LLM provider (OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Gemini, NVIDIA NIM, custom), updating credentials or metadata, and deleting integrations using the ax CLI."
+description: Creates, reads, updates, and deletes Arize AI integrations that store LLM provider credentials used by evaluators and other Arize features. Supports any LLM provider (e.g. OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Gemini, NVIDIA NIM). Use when the user mentions AI integration, LLM provider credentials, create integration, list integrations, update credentials, delete integration, or connecting an LLM provider to Arize.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Requires the ax CLI and a configured Arize profile.
 ---
 
 # Arize AI Integration Skill
diff --git a/skills/arize-annotation/SKILL.md b/skills/arize-annotation/SKILL.md
index 0e66ee462..3f69f32bc 100644
--- a/skills/arize-annotation/SKILL.md
+++ b/skills/arize-annotation/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-annotation
-description: "INVOKE THIS SKILL when creating, managing, or using annotation configs or annotation queues on Arize (categorical, continuous, freeform), or applying human annotations to project spans via the Python SDK. Configs are the label schema for human feedback; queues are review workflows that route records to annotators. Triggers: annotation config, annotation queue, label schema, human feedback schema, bulk annotate spans, update_annotations, labeling queue, annotate record."
+description: Creates and manages annotation configs (categorical, continuous, freeform label schemas) and annotation queues (human review workflows) on Arize. Applies human annotations to project spans via the Python SDK. Use when the user mentions annotation config, annotation queue, label schema, human feedback, bulk annotate spans, update_annotations, labeling queue, annotate record, or human review.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Requires the ax CLI and a configured Arize profile.
 ---
 
 # Arize Annotation Skill
diff --git a/skills/arize-dataset/SKILL.md b/skills/arize-dataset/SKILL.md
index 76258eecd..4046570a8 100644
--- a/skills/arize-dataset/SKILL.md
+++ b/skills/arize-dataset/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-dataset
-description: "INVOKE THIS SKILL when creating, managing, or querying Arize datasets and examples. Also use when the user needs test data or evaluation examples for their model. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI."
+description: Creates, manages, and queries Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI. Use when the user needs test data, evaluation examples, or mentions create dataset, list datasets, export dataset, append examples, dataset version, golden dataset, or test set.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Requires the ax CLI and a configured Arize profile.
 ---
 
 # Arize Dataset Skill
diff --git a/skills/arize-evaluator/SKILL.md b/skills/arize-evaluator/SKILL.md
index 660e9bd62..1336030d8 100644
--- a/skills/arize-evaluator/SKILL.md
+++ b/skills/arize-evaluator/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-evaluator
-description: "INVOKE THIS SKILL for LLM-as-judge evaluation workflows on Arize: creating/updating evaluators, running evaluations on spans or experiments, tasks, trigger-run, column mapping, and continuous monitoring. Use when the user says: create an evaluator, LLM judge, hallucination/faithfulness/correctness/relevance, run eval, score my spans or experiment, ax tasks, trigger-run, trigger eval, column mapping, continuous monitoring, query filter for evals, evaluator version, or improve an evaluator prompt."
+description: Handles LLM-as-judge evaluation workflows on Arize including creating/updating evaluators, running evaluations on spans or experiments, managing tasks, trigger-run operations, column mapping, and continuous monitoring. Use when the user mentions create evaluator, LLM judge, hallucination, faithfulness, correctness, relevance, run eval, score spans, score experiment, trigger-run, column mapping, continuous monitoring, or improve evaluator prompt.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Requires the ax CLI and a configured Arize profile with an AI integration.
 ---
 
 # Arize Evaluator Skill
diff --git a/skills/arize-experiment/SKILL.md b/skills/arize-experiment/SKILL.md
index 0d9c3320a..45759467a 100644
--- a/skills/arize-experiment/SKILL.md
+++ b/skills/arize-experiment/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-experiment
-description: "INVOKE THIS SKILL when creating, running, or analyzing Arize experiments. Also use when the user wants to evaluate or measure model performance, compare models (including GPT-4, Claude, or others), or assess how well their AI is doing. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI."
+description: Creates, runs, and analyzes Arize experiments for evaluating and comparing model performance. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI. Use when the user mentions create experiment, run experiment, compare models, model performance, evaluate AI, experiment results, benchmark, A/B test models, or measure accuracy.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Requires the ax CLI and a configured Arize profile.
 ---
 
 # Arize Experiment Skill
diff --git a/skills/arize-instrumentation/SKILL.md b/skills/arize-instrumentation/SKILL.md
index 6da715d99..f1a16a54b 100644
--- a/skills/arize-instrumentation/SKILL.md
+++ b/skills/arize-instrumentation/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-instrumentation
-description: "INVOKE THIS SKILL when adding Arize AX tracing or observability to an app for the first time, or when the user wants to instrument their LLM app or get started with LLM observability. Follow the Agent-Assisted Tracing two-phase flow: analyze the codebase (read-only), then implement after user confirmation. When the app uses LLM tool/function calling, add manual CHAIN + TOOL spans. Leverages https://arize.com/docs/ax/alyx/tracing-assistant and https://arize.com/docs/PROMPT.md."
+description: Adds Arize AX tracing to an LLM application for the first time. Follows a two-phase agent-assisted flow to analyze the codebase then implement instrumentation after user confirmation. Use when the user wants to instrument their app, add tracing from scratch, set up LLM observability, integrate OpenTelemetry or openinference, or get started with Arize tracing.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Python and TypeScript/JavaScript apps use openinference-instrumentation packages for auto-instrumentation. Java and Go apps use the OpenTelemetry SDK with manual OpenInference spans. See https://arize.com/docs/PROMPT.md for setup details.
 ---
 
 # Arize Instrumentation Skill
@@ -46,6 +50,7 @@ Before changing code:
    - Python: `pyproject.toml`, `requirements.txt`, `setup.py`, `Pipfile`
    - TypeScript/JavaScript: `package.json`
    - Java: `pom.xml`, `build.gradle`, `build.gradle.kts`
+   - Go: `go.mod`
 
 2. **Scan import statements** in source files to confirm what is actually used.
 
@@ -57,8 +62,8 @@ Before changing code:
 
 | Item | Examples |
 |------|----------|
-| Language | Python, TypeScript/JavaScript, Java |
-| Package manager | pip/poetry/uv, npm/pnpm/yarn, maven/gradle |
+| Language | Python, TypeScript/JavaScript, Java, Go |
+| Package manager | pip/poetry/uv, npm/pnpm/yarn, maven/gradle, go modules |
 | LLM providers | OpenAI, Anthropic, LiteLLM, Bedrock, etc. |
 | Frameworks | LangChain, LangGraph, LlamaIndex, Vercel AI SDK, Mastra, etc. |
 | Existing tracing | Any OTel or vendor setup |
@@ -86,8 +91,9 @@ The **canonical list** of supported integrations and doc URLs is in the [Agent S
 - **Python frameworks:** [LangChain](https://arize.com/docs/ax/integrations/python-agent-frameworks/langchain), [LangGraph](https://arize.com/docs/ax/integrations/python-agent-frameworks/langgraph), [LlamaIndex](https://arize.com/docs/ax/integrations/python-agent-frameworks/llamaindex), [CrewAI](https://arize.com/docs/ax/integrations/python-agent-frameworks/crewai), [DSPy](https://arize.com/docs/ax/integrations/python-agent-frameworks/dspy), [AutoGen](https://arize.com/docs/ax/integrations/python-agent-frameworks/autogen), [Semantic Kernel](https://arize.com/docs/ax/integrations/python-agent-frameworks/semantic-kernel), [Pydantic AI](https://arize.com/docs/ax/integrations/python-agent-frameworks/pydantic), [Haystack](https://arize.com/docs/ax/integrations/python-agent-frameworks/haystack), [Guardrails AI](https://arize.com/docs/ax/integrations/python-agent-frameworks/guardrails-ai), [Hugging Face Smolagents](https://arize.com/docs/ax/integrations/python-agent-frameworks/hugging-face-smolagents), [Instructor](https://arize.com/docs/ax/integrations/python-agent-frameworks/instructor), [Agno](https://arize.com/docs/ax/integrations/python-agent-frameworks/agno), [Google ADK](https://arize.com/docs/ax/integrations/python-agent-frameworks/google-adk), [MCP](https://arize.com/docs/ax/integrations/python-agent-frameworks/model-context-protocol), [Portkey](https://arize.com/docs/ax/integrations/python-agent-frameworks/portkey), [Together AI](https://arize.com/docs/ax/integrations/python-agent-frameworks/together-ai), [BeeAI](https://arize.com/docs/ax/integrations/python-agent-frameworks/beeai), [AWS Bedrock Agents](https://arize.com/docs/ax/integrations/python-agent-frameworks/aws).
 - **TypeScript/JavaScript:** [LangChain JS](https://arize.com/docs/ax/integrations/ts-js-agent-frameworks/langchain), [Mastra](https://arize.com/docs/ax/integrations/ts-js-agent-frameworks/mastra), [Vercel AI SDK](https://arize.com/docs/ax/integrations/ts-js-agent-frameworks/vercel), [BeeAI JS](https://arize.com/docs/ax/integrations/ts-js-agent-frameworks/beeai).
 - **Java:** [LangChain4j](https://arize.com/docs/ax/integrations/java/langchain4j), [Spring AI](https://arize.com/docs/ax/integrations/java/spring-ai), [Arconia](https://arize.com/docs/ax/integrations/java/arconia).
+- **Go:** No first-party auto-instrumentation packages today — use the OpenTelemetry Go SDK with manual [OpenInference](https://github.com/Arize-ai/openinference) attributes per [Manual instrumentation](https://arize.com/docs/ax/instrument/manual-instrumentation).
 - **Platforms (UI-based):** [LangFlow](https://arize.com/docs/ax/integrations/platforms/langflow), [Flowise](https://arize.com/docs/ax/integrations/platforms/flowise), [Dify](https://arize.com/docs/ax/integrations/platforms/dify), [Prompt flow](https://arize.com/docs/ax/integrations/platforms/prompt-flow).
-- **Fallback:** [Manual instrumentation](https://arize.com/docs/ax/observe/tracing/setup/manual-instrumentation), [All integrations](https://arize.com/docs/ax/integrations).
+- **Fallback:** [Manual instrumentation](https://arize.com/docs/ax/instrument/manual-instrumentation), [All integrations](https://arize.com/docs/ax/integrations).
 
 **Fetch the matched doc pages** from the [full routing table in PROMPT.md](https://arize.com/docs/PROMPT.md) for exact installation and code snippets. Use [llms.txt](https://arize.com/docs/llms.txt) as a fallback for doc discovery if needed.
 
@@ -104,13 +110,14 @@ Proceed **only after the user confirms** the Phase 1 analysis.
    - Python: `pip install arize-otel` plus `openinference-instrumentation-{name}` (hyphens in package name; underscores in import, e.g. `openinference.instrumentation.llama_index`).
    - TypeScript/JavaScript: `@opentelemetry/sdk-trace-node` plus the relevant `@arizeai/openinference-*` package.
    - Java: OpenTelemetry SDK plus `openinference-instrumentation-*` in pom.xml or build.gradle.
+   - Go: `go get go.opentelemetry.io/otel go.opentelemetry.io/otel/sdk go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` — no auto-instrumentors yet, so the agent sets OpenInference attributes manually on spans. **Wire the exporter** with `otlptracehttp.WithEndpoint("otlp.arize.com")` (US) or `otlptracehttp.WithEndpoint("otlp.eu-west-1a.arize.com")` (EU) — pass the bare hostname, no `https://` scheme — and `otlptracehttp.WithHeaders(map[string]string{"space_id": ..., "api_key": ...})`. Recent OTel Go modules require Go ≥ 1.23 — `go mod tidy` may bump the toolchain.
 3. **Credentials** — User needs an **Arize API Key** and **Space ID**. Check existing `ax` profiles for `ARIZE_API_KEY` and `ARIZE_SPACE` — never read `.env` files:
    - Run `ax profiles show` to check for an existing profile.
    - If no profile exists, guide the user to run `ax profiles create` which provides an **interactive wizard** that walks through API key and space setup. See [CLI profiles docs](https://arize.com/docs/api-clients/cli/profiles) for details.
    - If the user needs to find their API key manually, direct them to **https://app.arize.com** and to navigate to the settings page (do not use organization-specific URLs with placeholder IDs — they won't resolve for new users).
-   - If credentials are not set, instruct the user to set them as environment variables — never embed raw values in generated code. All generated instrumentation code must reference `os.environ["ARIZE_API_KEY"]` (Python) or `process.env.ARIZE_API_KEY` (TypeScript/JavaScript).
+   - If credentials are not set, instruct the user to set them as environment variables — never embed raw values in generated code. All generated instrumentation code must reference `os.environ["ARIZE_API_KEY"]` (Python), `process.env.ARIZE_API_KEY` (TypeScript/JavaScript), or `os.Getenv("ARIZE_API_KEY")` (Go).
    - See references/ax-profiles.md for full profile setup and troubleshooting.
-4. **Centralized instrumentation** — Create a single module (e.g. `instrumentation.py`, `instrumentation.ts`) and initialize tracing **before** any LLM client is created.
+4. **Centralized instrumentation** — Create a single module (e.g. `instrumentation.py`, `instrumentation.ts`, `instrumentation.go`) and initialize tracing **before** any LLM client is created.
 5. **Existing OTel** — If there is already a TracerProvider, add Arize as an **additional** exporter (e.g. BatchSpanProcessor with Arize OTLP). Do not replace existing setup unless the user asks.
 
 ### Implementation rules
@@ -119,8 +126,11 @@ Proceed **only after the user confirms** the Phase 1 analysis.
 - Prefer the repo's native integration surface before adding generic OpenTelemetry plumbing. If the framework ships an exporter or observability package, use that first unless there is a documented gap.
 - **Fail gracefully** if env vars are missing (warn, do not crash).
 - **Import order:** register tracer → attach instrumentors → then create LLM clients.
-- **Project name attribute (required):** Arize rejects spans with HTTP 500 if the project name is missing — `service.name` alone is not accepted. Set it as a **resource attribute** on the TracerProvider (recommended — one place, applies to all spans): Python: `register(project_name="my-app")` handles it automatically (sets `"openinference.project.name"` on the resource); TypeScript: Arize accepts both `"model_id"` (shown in the official TS quickstart) and `"openinference.project.name"` via `SEMRESATTRS_PROJECT_NAME` from `@arizeai/openinference-semantic-conventions` (shown in the manual instrumentation docs) — both work. For routing spans to different projects in Python, use `set_routing_context(space_id=..., project_name=...)` from `arize.otel`.
-- **CLI/script apps — flush before exit:** `provider.shutdown()` (TS) / `provider.force_flush()` then `provider.shutdown()` (Python) must be called before the process exits, otherwise async OTLP exports are dropped and no traces appear.
+- **Project name attribute (required):** Arize rejects spans with HTTP 500 if the project name is missing — `service.name` alone is not accepted. Set it as a **resource attribute** on the TracerProvider (recommended — one place, applies to all spans):
+  - **Python:** `register(project_name="my-app")` handles it automatically (sets `"openinference.project.name"` on the resource). For routing spans to different projects, use `set_routing_context(space_id=..., project_name=...)` from `arize.otel`.
+  - **TypeScript:** Arize accepts both `"model_id"` (shown in the official TS quickstart) and `"openinference.project.name"` via `SEMRESATTRS_PROJECT_NAME` from `@arizeai/openinference-semantic-conventions` (shown in the manual instrumentation docs) — both work.
+  - **Go:** Pass `attribute.String("openinference.project.name", "my-app")` to `resource.New(...)` and apply via `sdktrace.WithResource(res)`. The Go SDK has no helper for this, so it must be set manually on every TracerProvider.
+- **CLI/script apps — flush before exit:** `provider.shutdown()` (TS) / `provider.force_flush()` then `provider.shutdown()` (Python) / `tp.Shutdown(ctx)` (Go) must be called before the process exits, otherwise async OTLP exports are dropped and no traces appear.
 - **When the app has tool/function execution:** add manual CHAIN + TOOL spans (see **Enriching traces** below) so the trace tree shows each tool call and its result — otherwise traces will look sparse (only LLM API spans, no tool input/output).
 
 ## Enriching traces: manual spans for tool use and agent loops
@@ -151,10 +161,26 @@ To avoid sparse traces where tool inputs/outputs are missing:
 
 | Attribute | Use |
 |-----------|-----|
-| `openinference.span.kind` | `"CHAIN"` or `"TOOL"` |
+| `openinference.span.kind` | Pick the right value: `"LLM"` for raw provider API calls (OpenAI, Anthropic, etc.); `"CHAIN"` for orchestration / agent-loop boundaries; `"TOOL"` for tool/function execution; `"RETRIEVER"` for vector-store / search lookups; `"EMBEDDING"` for embedding API calls; `"AGENT"` for an autonomous sub-agent run nested inside a larger chain; `"RERANKER"` for rerank API calls; `"GUARDRAIL"` for guardrail/policy checks; `"EVALUATOR"` for online eval calls. |
 | `input.value` | string (e.g. user message or JSON of tool args) |
 | `output.value` | string (e.g. final reply or JSON of tool result) |
 
+**LLM-span attributes (set these in addition to the three above when the span is an actual LLM call):**
+
+| Attribute | Use |
+|-----------|-----|
+| `llm.model_name` | model identifier (e.g. `"gpt-4o-mini"`) |
+| `llm.provider` / `llm.system` | provider name (e.g. `"openai"`, `"anthropic"`) |
+| `llm.input_messages.{i}.message.role` | `"system"` / `"user"` / `"assistant"` / `"tool"` for the i-th input message |
+| `llm.input_messages.{i}.message.content` | text content of the i-th input message |
+| `llm.output_messages.{i}.message.role` | role of the i-th output message |
+| `llm.output_messages.{i}.message.content` | text content of the i-th output message |
+| `llm.token_count.prompt` | int — prompt/input tokens |
+| `llm.token_count.completion` | int — completion/output tokens |
+| `llm.token_count.total` | int — total tokens |
+
+In Python and TypeScript these names are exposed via `openinference-semantic-conventions` packages; in Go they must be hand-typed as the strings above.
+
 **Python pattern:** Get the global tracer (same provider as Arize), then use context managers so tool spans are children of the CHAIN span and appear in the same trace as the LLM spans:
 
 ```python
@@ -177,7 +203,51 @@ with tracer.start_as_current_span("run_agent") as chain_span:
     chain_span.set_attribute("output.value", final_reply)
 ```
 
-See [Manual instrumentation](https://arize.com/docs/ax/observe/tracing/setup/manual-instrumentation) for more span kinds and attributes.
+**Go pattern:** Get a tracer from the global TracerProvider (registered via `otel.SetTracerProvider`), then nest spans with `tracer.Start` so tool spans become children of the CHAIN span.
+
+> **Critical for short-lived processes:** never call `log.Fatalf` / `os.Exit` after a span has started — they skip the deferred `tp.Shutdown(ctx)` and the in-flight CHAIN/LLM spans never flush. Use `log.Printf` + `return` from `main` instead, and keep `tp.Shutdown(ctx)` deferred at the top of `main`.
+
+```go
+import (
+    "context"
+    "encoding/json"
+    "go.opentelemetry.io/otel"
+    "go.opentelemetry.io/otel/attribute"
+)
+
+var tracer = otel.Tracer("my-app")
+
+func runAgent(ctx context.Context, userMessage string) string {
+    ctx, chainSpan := tracer.Start(ctx, "run_agent")
+    defer chainSpan.End()
+    chainSpan.SetAttributes(
+        attribute.String("openinference.span.kind", "CHAIN"),
+        attribute.String("input.value", userMessage),
+    )
+
+    // ... LLM call ...
+    for _, toolUse := range toolUses {
+        ctx, toolSpan := tracer.Start(ctx, toolUse.Name)
+        argsJSON, err := json.Marshal(toolUse.Input)
+        if err != nil {
+            toolSpan.RecordError(err)
+        }
+        toolSpan.SetAttributes(
+            attribute.String("openinference.span.kind", "TOOL"),
+            attribute.String("input.value", string(argsJSON)),
+        )
+        result := runTool(toolUse.Name, toolUse.Input)
+        toolSpan.SetAttributes(attribute.String("output.value", result))
+        toolSpan.End()
+        // ... append tool result to messages, call LLM again ...
+    }
+
+    chainSpan.SetAttributes(attribute.String("output.value", finalReply))
+    return finalReply
+}
+```
+
+See [Manual instrumentation](https://arize.com/docs/ax/instrument/manual-instrumentation) for more span kinds and attributes.
 
 ## Verification
 
@@ -192,7 +262,7 @@ After implementation:
 
 1. Run the application and trigger at least one LLM call.
 2. **Use the `arize-trace` skill** to confirm traces arrived. If empty, retry shortly. Verify spans have expected `openinference.span.kind`, `input.value`/`output.value`, and parent-child relationships.
-3. If no traces: verify `ARIZE_SPACE` and `ARIZE_API_KEY`, ensure tracer is initialized before instrumentors and clients, check connectivity to `otlp.arize.com:443`, and inspect app/runtime exporter logs so you can tell whether spans are being emitted locally but rejected remotely. For debug set `GRPC_VERBOSITY=debug` or pass `log_to_console=True` to `register()`. Common gotchas: (a) missing project name resource attribute causes HTTP 500 rejections — `service.name` alone is not enough; Python: pass `project_name` to `register()`; TypeScript: set `"model_id"` or `SEMRESATTRS_PROJECT_NAME` on the resource; (b) CLI/script processes exit before OTLP exports flush — call `provider.force_flush()` then `provider.shutdown()` before exit; (c) CLI-visible spaces/projects can disagree with a collector-targeted space ID — report the mismatch instead of silently rewriting credentials.
+3. If no traces: verify `ARIZE_SPACE` and `ARIZE_API_KEY`, ensure tracer is initialized before instrumentors and clients, check connectivity to `otlp.arize.com:443`, and inspect app/runtime exporter logs so you can tell whether spans are being emitted locally but rejected remotely. For debug set `GRPC_VERBOSITY=debug` or pass `log_to_console=True` to `register()`. Common gotchas: (a) missing project name resource attribute causes HTTP 500 rejections — `service.name` alone is not enough; Python: pass `project_name` to `register()`; TypeScript: set `"model_id"` or `SEMRESATTRS_PROJECT_NAME` on the resource; Go: add `attribute.String("openinference.project.name", "my-app")` to `resource.New(...)`; (b) CLI/script processes exit before OTLP exports flush — call `provider.force_flush()` then `provider.shutdown()` (Python/TS) or `tp.Shutdown(ctx)` (Go) before exit; (c) CLI-visible spaces/projects can disagree with a collector-targeted space ID — report the mismatch instead of silently rewriting credentials.
 4. If the app uses tools: confirm CHAIN and TOOL spans appear with `input.value` / `output.value` so tool calls and results are visible.
 
 When verification is blocked by CLI or account issues, end with a concrete status:
diff --git a/skills/arize-link/SKILL.md b/skills/arize-link/SKILL.md
index fbb3a339d..44d9f470a 100644
--- a/skills/arize-link/SKILL.md
+++ b/skills/arize-link/SKILL.md
@@ -1,6 +1,9 @@
 ---
 name: arize-link
-description: Generate deep links to the Arize UI. Use when the user wants a clickable URL to open or share a specific trace, span, session, dataset, labeling queue, evaluator, or annotation config, or when sharing Arize resources with team members.
+description: Generates deep links to the Arize UI for traces, spans, sessions, datasets, labeling queues, evaluators, and annotation configs. Produces clickable URLs for sharing Arize resources with team members. Use when the user wants to link to or open a trace, span, session, dataset, evaluator, or annotation config in the Arize UI.
+metadata:
+  author: arize
+  version: "1.0"
 ---
 
 # Arize Link
diff --git a/skills/arize-prompt-optimization/SKILL.md b/skills/arize-prompt-optimization/SKILL.md
index 968255da1..12b381467 100644
--- a/skills/arize-prompt-optimization/SKILL.md
+++ b/skills/arize-prompt-optimization/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-prompt-optimization
-description: "INVOKE THIS SKILL when optimizing, improving, or debugging LLM prompts using production trace data, evaluations, and annotations. Also use when the user wants to make their AI respond better or improve AI output quality. Covers extracting prompts from spans, gathering performance signal, and running a data-driven optimization loop using the ax CLI."
+description: Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Requires the ax CLI and a configured Arize profile.
 ---
 
 # Arize Prompt Optimization Skill
diff --git a/skills/arize-trace/SKILL.md b/skills/arize-trace/SKILL.md
index 28420f2f3..5b76c5821 100644
--- a/skills/arize-trace/SKILL.md
+++ b/skills/arize-trace/SKILL.md
@@ -1,6 +1,10 @@
 ---
 name: arize-trace
-description: "INVOKE THIS SKILL when downloading, exporting, or inspecting Arize traces and spans, or when a user wants to look at what their LLM app is doing using existing trace data, or when an already-instrumented app has a bug or error to investigate. Use for debugging unknown runtime issues, failures, and behavior regressions. Covers exporting traces by ID, spans by ID, sessions by ID, and root-cause investigation with the ax CLI."
+description: Downloads, exports, and inspects existing Arize traces and spans to understand what an LLM app is doing or debug runtime issues. Covers exporting traces by ID, spans by ID, sessions by ID, and root-cause investigation using the ax CLI. Use when the user wants to look at existing trace data, see what their LLM app is doing, export traces, download spans, investigate errors, or analyze behavior regressions.
+metadata:
+  author: arize
+  version: "1.0"
+compatibility: Requires the ax CLI and a configured Arize profile.
 ---
 
 # Arize Trace Skill
diff --git a/skills/phoenix-cli/SKILL.md b/skills/phoenix-cli/SKILL.md
index 3134e9175..bb7d71d9e 100644
--- a/skills/phoenix-cli/SKILL.md
+++ b/skills/phoenix-cli/SKILL.md
@@ -24,18 +24,28 @@ px trace list
 px trace get <trace-id>
 px trace annotate <trace-id>
 px trace add-note <trace-id>
+px trace-annotations delete
 px span list
 px span annotate <span-id>
 px span add-note <span-id>
+px span-annotations delete
 px session list
 px session get <session-id>
 px session annotate <session-id>
 px session add-note <session-id>
+px session-annotations delete
 px dataset list
 px dataset get <name>
 px project list
+px project get <name>
 px annotation-config list
 px auth status
+px profile list
+px profile show [name]
+px profile create <name>
+px profile use <name>
+px profile edit <name>
+px profile delete <name>
 ```
 
 ## Setup
@@ -52,9 +62,13 @@ Always use `--format raw --no-progress` when piping to `jq`.
 
 | Task | Files |
 | ---- | ----- |
-| Look at sampled traces and write specific notes about what went wrong (no taxonomy yet) | [references/open-coding](references/open-coding.md) |
+| Look at sampled traces, spans, or sessions and write specific notes about what went wrong (no taxonomy yet) | [references/open-coding](references/open-coding.md) |
 | Group those notes into a structured failure taxonomy and quantify what matters | [references/axial-coding](references/axial-coding.md) |
 
+Both stages tag every artifact with one shared **coding annotation identifier** (descriptive shape, e.g. `coding-run:chatbot-context-loss-2026-05-06`) so the run is queryable, reversible, and viewable as a unit. Pass `--identifier <value>` explicitly on every `px` call — shell inheritance is unreliable across agent harnesses. Open coding writes notes via `px ... add-note` and records a small local JSONL sidecar at `.px/coding/<sanitized-identifier>.jsonl`; axial coding reads that sidecar as the deterministic handoff and records labels in `.px/coding/<sanitized-identifier>-axial.jsonl`. Pick the identifier once per run (see [references/open-coding.md](references/open-coding.md#coding-annotation-identifier-pick-this-first)), then share the Phoenix UI link from the wrap-up section. Revert is opt-in and runs three identifier-bound DELETEs only after explicit user confirmation.
+
+> **Workflow term vs. server annotation name.** The skill prose calls this value the **coding annotation identifier** (shell-variable hint: `CODING_ANNOTATION_IDENTIFIER`). The server-side annotation NAME used for the UI filter is unchanged — `coding_session_id` — for data compatibility with rows already written by previous runs. Don't try to rename the server-side annotation; treat the asymmetry as load-bearing.
+
 ## Workflows
 
 **"What do I do after instrumenting?" / "Where do I focus?" / "What's going wrong?"**
@@ -64,7 +78,7 @@ Always use `--format raw --no-progress` when piping to `jq`.
 
 | Prefix | Description |
 | ------ | ----------- |
-| `references/open-coding` | Free-form notes against sampled traces — reach for it whenever the user wants to make sense of traces but has no failure categories yet |
+| `references/open-coding` | Free-form notes against sampled traces, spans, or sessions — reach for it whenever the user wants to make sense of LLM traffic but has no failure categories yet. Includes a unit-of-analysis diagnostic so the workflow runs at the level the failure modes actually live at (trace for stateless single-shot calls, session for multi-turn agents, span for mechanical/in-isolation failures). |
 | `references/axial-coding` | Inductive grouping of notes into a MECE taxonomy with counts — reach for it whenever the user has observations and needs categories or eval targets |
 
 ## Auth
@@ -72,15 +86,46 @@ Always use `--format raw --no-progress` when piping to `jq`.
 ```bash
 px auth status                                # check connection and authentication
 px auth status --endpoint http://other:6006   # check a specific endpoint
+px auth status --profile staging              # check a named profile's connection
+```
+
+## Profiles
+
+Named profiles let you switch between multiple Phoenix instances (local, staging, cloud) without juggling environment variables. Profiles are stored in `~/.px/settings.json` (or `$XDG_CONFIG_HOME/px/settings.json`).
+
+Configuration priority (highest to lowest): CLI flags > env vars > active profile > built-in defaults.
+
+```bash
+px profile list                              # list all profiles (shows active profile)
+px profile show                              # show the active profile's settings
+px profile show staging                      # show a named profile's settings
+px profile create prod --endpoint https://app.phoenix.arize.com --api-key <key> --activate
+px profile create local --endpoint http://localhost:6006 --project my-app
+px profile use prod                          # switch the active profile
+px profile edit prod                         # open profile JSON in $EDITOR (validates on save)
+px profile delete prod --yes                 # delete a profile (--yes skips confirmation)
 ```
 
+Use `--profile <name>` on any command to target a specific profile without changing the active one:
+
+```bash
+px trace list --profile staging --limit 10 --format raw --no-progress | jq .
+px auth status --profile prod
+```
+
+`px profile create` options: `--endpoint <url>`, `--project <name>`, `--api-key <key>`, `--header <key=value>` (repeatable), `--activate`.
+
 ## Projects
 
 ```bash
 px project list                                            # list all projects (table view)
 px project list --format raw --no-progress | jq '.[].name' # project names as JSON
+px project get my-project --format raw --no-progress       # single record by exact name
+px project get my-project --format raw --no-progress | jq -r '.id'  # extract project id
 ```
 
+`project get` exits with `ExitCode.FAILURE` (1) on a name miss and writes a `StructuredError` `{error, code: "FAILURE", hint}` to stderr in `--format json|raw`.
+
 ## Traces
 
 ```bash
@@ -94,9 +139,14 @@ px trace get <trace-id> --format raw | jq '.spans[] | select(.status_code != "OK
 px trace get <trace-id> --include-notes --format raw | jq '.notes'
 px trace annotate <trace-id> --name reviewer --label pass
 px trace annotate <trace-id> --name reviewer --score 0.9 --format raw --no-progress
+px trace annotate <trace-id> --name reviewer --label pass --identifier "<coding-annotation-id>"  # tag with a coding annotation identifier
 px trace add-note <trace-id> --text "needs follow-up"
+px trace add-note <trace-id> --text "needs follow-up" --identifier "<coding-annotation-id>"  # tag + upsert on identifier
+px trace-annotations delete --identifier "<coding-annotation-id>" --all -y            # nuke every annotation tied to this coding annotation identifier
 ```
 
+`px <entity>-annotations delete` requires `--all` or both `--start-time` and `--end-time` and emits `{deleted: true, target, filter}` on success.
+
 ### Trace JSON shape
 
 ```
@@ -147,7 +197,10 @@ px span list output.json --limit 100                       # save to JSON file
 px span list --format raw --no-progress | jq '.[] | select(.status_code == "ERROR")'
 px span annotate <span-id> --name reviewer --label pass
 px span annotate <span-id> --name checker --score 1 --annotator-kind CODE
+px span annotate <span-id> --name reviewer --label pass --identifier "<coding-annotation-id>"  # tag with a coding annotation identifier
 px span add-note <span-id> --text "verified by agent"
+px span add-note <span-id> --text "verified by agent" --identifier "<coding-annotation-id>"  # tag + upsert on identifier
+px span-annotations delete --identifier "<coding-annotation-id>" --all -y           # nuke every annotation tied to this coding annotation identifier
 ```
 
 ### Span JSON shape
@@ -183,7 +236,10 @@ px session get <session-id> --include-annotations --format raw | jq '.session.an
 px session get <session-id> --include-notes --format raw | jq '.session.notes'
 px session annotate <session-id> --name reviewer --label pass
 px session annotate <session-id> --name reviewer --score 0.9 --format raw --no-progress
+px session annotate <session-id> --name reviewer --label pass --identifier "<coding-annotation-id>"  # tag with a coding annotation identifier
 px session add-note <session-id> --text "verified by agent"
+px session add-note <session-id> --text "verified by agent" --identifier "<coding-annotation-id>"  # tag + upsert on identifier
+px session-annotations delete --identifier "<coding-annotation-id>" --all -y              # nuke every annotation tied to this coding annotation identifier
 ```
 
 ### Session JSON shape
@@ -192,6 +248,7 @@ px session add-note <session-id> --text "verified by agent"
 SessionData
   id, session_id, project_id
   start_time, end_time
+  token_count_prompt, token_count_completion, token_count_total  — cumulative across all LLM spans in the session (int, default 0)
   annotations[] (with --include-annotations, excludes note)
     name, result { score, label, explanation }
   notes[] (with --include-notes)
diff --git a/skills/phoenix-cli/references/axial-coding.md b/skills/phoenix-cli/references/axial-coding.md
index b0b961964..b37804877 100644
--- a/skills/phoenix-cli/references/axial-coding.md
+++ b/skills/phoenix-cli/references/axial-coding.md
@@ -4,17 +4,40 @@ Group open-ended observations into structured failure taxonomies. Axial coding t
 
 **Reach for this whenever** the user has observations and needs structure — e.g., "what categories of failures do we have", "what should I build evals for", "how do I prioritize fixes", "group these notes", "MECE breakdown", or any framing that asks for categories or counts grounded in real traces rather than invented top-down.
 
+## Coding annotation identifier (reuse the open-coding value)
+
+Reuse the **coding annotation identifier** chosen in open coding — every `annotate` call below passes `--identifier "$CODING_ANNOTATION_IDENTIFIER"` explicitly. In a fresh shell or fresh agent invocation, set `CODING_ANNOTATION_IDENTIFIER` to the same value (recoverable from the wrap-up UI URL or by listing `.px/coding/*.jsonl`); don't mint a new id. See [open-coding.md#coding-annotation-identifier-pick-this-first](open-coding.md#coding-annotation-identifier-pick-this-first) for the rationale and the sanitization rule.
+
+> **Workflow term vs. server annotation name.** The skill calls this value the **coding annotation identifier**; the server annotation NAME used for the UI filter stays `coding_session_id` for data compatibility. Don't try to rename the server-side key.
+
+```bash
+CODING_ANNOTATION_IDENTIFIER="coding-run:chatbot-context-loss-2026-05-06"
+SLUG=$(echo -n "$CODING_ANNOTATION_IDENTIFIER" | sed 's/[^a-zA-Z0-9_-]/-/g')
+NOTES_SIDECAR=".px/coding/${SLUG}.jsonl"
+AXIAL_SIDECAR=".px/coding/${SLUG}-axial.jsonl"
+```
+
 ## Choosing the unit
 
-Open-coding notes are usually **trace-level** (see [open-coding.md#choosing-the-unit](open-coding.md#choosing-the-unit)) — examples below lead with `px trace` and fall back to `px span` for span-level notes. **An axial label can live at a different level than the note that informed it** — that's a feature: a trace-level note "answered shipping when asked returns" can produce a span-level annotation on the retrieval span once a pattern reveals retrieval as the consistent culprit. Re-attribution at axial coding time is what axial coding *is*. Session-level rollups go through REST `/v1/projects/{id}/session_annotations` (no CLI write path).
+Open coding's diagnostic in [open-coding.md#choosing-the-unit-of-analysis](open-coding.md#choosing-the-unit-of-analysis) commits to a unit (trace, span, or session). Axial coding inherits that unit by default — if open coding ran at the session level, axial labels will too; same for trace and span.
+
+**An axial label can live at a different level than the note that informed it** — that's a feature, and it works in every direction:
+
+- *Trace → span*: a trace-level note "answered shipping when asked about returns" can produce a span-level annotation on the retrieval span once a pattern reveals retrieval as the consistent culprit.
+- *Trace → session*: a batch of trace-level notes describing single-turn confusion can produce a session-level annotation once you see the pattern is "the agent doesn't track the user's stated context across turns."
+- *Session → trace*: a session-level note about cross-turn drift may, on closer reading, attribute to one specific turn where the agent dropped the thread; a trace-level annotation can name that turn.
+
+Whichever level you write the axial label on, write the matching `coding_session_id` UI-filter annotation on the same entity (see [UI-filter annotation](#ui-filter-annotation) below) so the UI link picks it up.
 
 ## Process
 
-1. **Gather** — Collect open-coding notes from the entities you reviewed (trace-level by default)
-2. **Pattern** — Group notes with common themes
-3. **Name** — Create actionable category names
-4. **Attribute** — Decide what level each category lives at; an axial label can move from the note's level to the component the pattern implicates
-5. **Quantify** — Count failures per category
+1. **Set the coding annotation identifier** — set `CODING_ANNOTATION_IDENTIFIER` to the value used in open coding and re-derive `SLUG`, `NOTES_SIDECAR`, `AXIAL_SIDECAR` (see [Coding annotation identifier](#coding-annotation-identifier-reuse-the-open-coding-value))
+2. **Gather** — read open-coding notes from `$NOTES_SIDECAR` (at the unit committed in open coding); no server round-trip
+3. **Pattern** — group notes with common themes
+4. **Name** — create actionable category names
+5. **Attribute** — decide what level each category lives at; an axial label can move up (trace → session) or down (trace → span) from the source note's level to the level the pattern actually implicates
+6. **Record** — `px {trace,span,session} annotate ... --name axial_coding_category --label <cat> --identifier "$CODING_ANNOTATION_IDENTIFIER"`, add/update one JSONL sidecar row for the label, then write the matching `coding_session_id` UI-filter annotation
+7. **Quantify** — count failures per category from `$AXIAL_SIDECAR`
 
 ## Example Taxonomy
 
@@ -39,97 +62,110 @@ failure_taxonomy:
 
 ## Reading
 
-### 1. Gather — extract open-coding notes
+### 1. Gather — read this run's open-coding notes from the sidecar
 
-Open-coding notes are stored as annotations with `name="note"` and are only returned when `--include-notes` is passed. Use `--include-annotations` instead and you will get structured annotations but **not** notes — the server excludes notes from the annotations array.
+Open-coding wrote one JSONL line per note to `$NOTES_SIDECAR` (`.px/coding/${SLUG}.jsonl`). Read it directly — no server round-trip is needed. Each line has `entity_kind`, `entity_id`, `note`, `identifier`, and `ts`. If the same `(entity_kind, entity_id)` appears more than once, use the newest `ts` as the current note.
 
-```bash
-# Trace-level notes (default for open coding)
-px trace list --include-notes --format raw --no-progress | jq '
-  [ .[] | select((.notes // []) | length > 0) ]
-  | map({ trace_id: .traceId, notes: [ .notes[].result.explanation ] })
-'
+**Missing-file behavior.** An absent `$NOTES_SIDECAR` means open coding hasn't run for this coding annotation identifier in this CWD — stop and run open coding first, do not silently treat it as zero notes.
 
-# Span-level notes (when open coding dropped to span for mechanical failures)
-px span list --include-notes --format raw --no-progress | jq '
-  [ .[] | select((.notes // []) | length > 0) ]
-  | map({ span_id: .context.span_id, notes: [ .notes[].result.explanation ] })
-'
-```
+**Malformed lines.** Each line is independently parseable JSON. If `jq` reports a parse error, fix or drop that line manually; do not edit other lines.
+
+**Notes outside this run.** The sidecar only carries notes this CWD wrote. To pull notes another reviewer or earlier run wrote, fetch them via `px {trace,span,session} list --include-notes` (embeds notes into row output) — the workflow's sidecar is intentionally per-CWD-per-coding-identifier.
 
 ### 2. Group — synthesize categories
 
 Review the note text collected above. Manually identify recurring themes and draft candidate category names. Aim for MECE coverage: each note should fit exactly one category.
 
-### 3. Record — write axial-coding annotations
+### 3. Record — write axial-coding labels
 
-Write one annotation per entity using `px trace annotate` or `px span annotate`. The level can differ from where the source note lives — see the **Recording** section below.
+Write one annotation per entity using `px {trace,span,session} annotate`, passing `--identifier "$CODING_ANNOTATION_IDENTIFIER"` explicitly on every call, and record one JSONL row in `$AXIAL_SIDECAR` so [Quantify](#4-quantify--count-per-category-from-the-axial-sidecar) below can count without a server round-trip. The level can differ from where the source note lives — see [Recording](#recording) below.
 
-### 4. Quantify — count per category
+### 4. Quantify — count per category from the axial sidecar
 
-After recording, use `--include-annotations` to count how many entities carry each label. Examples below show span-level counts; for trace-level annotations, swap `px span list` for `px trace list` (the `.annotations[]` shape is the same).
+Counts come from `$AXIAL_SIDECAR` (populated by [Record](#3-record--write-axial-coding-labels)). No server query, no project-wide history mixed in — the sidecar holds exactly the labels this run wrote. Count the current rows by `axial_label`; if an entity appears more than once, use the newest `ts`.
 
-```bash
-px span list --include-annotations --format raw --no-progress | jq '
-  [ .[] | .annotations[]? | select(.name == "failure_category" and .result.label != null) ]
-  | group_by(.result.label)
-  | map({ label: .[0].result.label, count: length })
-  | sort_by(-.count)
-'
-```
+Same missing-file and malformed-line rules as `$NOTES_SIDECAR`: a missing axial sidecar means no labels have been written yet (run [Record](#3-record--write-axial-coding-labels)); malformed lines are line-local — fix or drop, don't edit neighbors.
 
-Filter to a specific annotation name to check coverage:
+## Recording
 
-```bash
-px span list --include-annotations --format raw --no-progress | jq '
-  [ .[] | select((.annotations // []) | any(.name == "failure_category")) ]
-  | length
-'
+Use the matching annotate command for the level the **label** belongs at — which may differ from where the source note lives (see [Choosing the unit](#choosing-the-unit)). Every call carries `--identifier "$CODING_ANNOTATION_IDENTIFIER"` and `--format raw --no-progress`, and is paired with a JSONL row in `$AXIAL_SIDECAR`.
+
+**Axial sidecar JSONL line shape (one per `annotate`):**
+
+```json
+{"entity_kind":"trace","entity_id":"<trace-id>","annotation_name":"axial_coding_category","axial_label":"<label>","explanation":"<optional explanation>","identifier":"<original identifier value, unsanitized>","ts":"<ISO-8601 UTC>"}
 ```
 
-## Recording
+Fields:
+- `entity_kind` — `"trace"`, `"span"`, or `"session"` (matches the `annotate` subcommand)
+- `entity_id` — the entity argument passed to `annotate`
+- `annotation_name` — always `"axial_coding_category"` for axial labels (the workflow's reserved annotation name)
+- `axial_label` — the `--label` value, verbatim; this is what [Quantify](#4-quantify--count-per-category-from-the-axial-sidecar) groups on
+- `explanation` — optional, but include it when the `annotate` call used `--explanation`
+- `identifier` — the **original** `$CODING_ANNOTATION_IDENTIFIER` value, unsanitized; the sanitized form lives only in the filename
+- `ts` — ISO-8601 UTC timestamp of the local append
 
-Use the matching annotate command for the level the **label** belongs at — which may differ from where the source note lives (see [Choosing the unit](#choosing-the-unit)):
+If you revise a label for the same entity under the same coding annotation identifier, either replace that row or append a newer row. When duplicate `(entity_kind, entity_id, annotation_name)` rows exist, the newest `ts` is the current label. This matches the server upsert behavior of `annotate --identifier`.
+
+Minimal trace example:
 
 ```bash
-# Trace-level label (most common — the trace as a whole exhibits the failure)
 px trace annotate <trace-id> \
-  --name failure_category \
+  --name axial_coding_category \
   --label answered_off_topic \
   --explanation "asked about returns; answer covered shipping" \
-  --annotator-kind HUMAN
-
-# Span-level label (when the pattern implicates a specific component)
-px span annotate <span-id> \
-  --name failure_category \
-  --label retrieval_off_topic \
-  --explanation "retrieved shipping docs for a returns query" \
-  --annotator-kind HUMAN
+  --annotator-kind HUMAN \
+  --identifier "$CODING_ANNOTATION_IDENTIFIER" \
+  --format raw --no-progress
+```
+
+Then add a matching JSONL row to `$AXIAL_SIDECAR` using the line shape above. For span or session labels, change `entity_kind`, `entity_id`, and the `px` subcommand accordingly.
+
+Accepted flags: `--name`, `--label`, `--score`, `--explanation`, `--annotator-kind` (`HUMAN`, `LLM`, `CODE`), `--identifier`. There is no `--sync` flag — the CLI passes `sync=true` itself.
+
+### UI-filter annotation
+
+Write a `coding_session_id` annotation at the same level as the axial label — see [open-coding.md#ui-filter-annotation](open-coding.md#ui-filter-annotation) for why the Phoenix UI filter requires a name-based annotation rather than the bare `--identifier`. If open coding already wrote `coding_session_id` on the same entity, this call upserts (idempotent). The annotation NAME `coding_session_id` is unchanged; only the workflow's spoken term is "coding annotation identifier".
+
+```bash
+# Same level as the axial label above
+px trace annotate <trace-id> \
+  --name coding_session_id \
+  --label "$CODING_ANNOTATION_IDENTIFIER" \
+  --identifier "$CODING_ANNOTATION_IDENTIFIER"
+# or px span annotate / px session annotate at matching levels
 ```
 
-Accepted flags: `--name`, `--label`, `--score`, `--explanation`, `--annotator-kind` (`HUMAN`, `LLM`, `CODE`). There are no `--identifier` or `--sync` flags on these commands.
+### Recording discipline
 
-### Bulk recording
+Axial coding categorizes the entities you took notes on during open coding. Use `$NOTES_SIDECAR` as the source of candidate entities and write labels only after reading the note text and surrounding trace/span/session context. Do **not** filter by `--status-code ERROR` — that captures only spans where Python raised, which excludes most failure modes (hallucination, wrong tone, retrieval miss). See [open-coding.md](open-coding.md#inspection) for the full reasoning.
 
-Axial coding categorizes the entities you took notes on during open coding. Do **not** filter by `--status-code ERROR` — that captures only spans where Python raised, which excludes most failure modes (hallucination, wrong tone, retrieval miss). See [open-coding.md](open-coding.md#inspection) for the full reasoning.
+**Fallback paths:** REST `POST /v1/{trace,span,session}_annotations` and `@arizeai/phoenix-client`'s `addSpanAnnotation` / `addSessionAnnotation` (no `addTraceAnnotation` is exported today — use REST or `px trace annotate`). The GraphQL endpoint rejects mutations.
+
+## Wrapping up
+
+After axial coding finishes, share the Phoenix UI link with the user. The link points to the project's traces table filtered by the `coding_session_id` annotation — `annotations['coding_session_id'].label == '<coding-annotation-id>'`. The UI route `/projects/:projectId` expects an encoded GraphQL node ID, not a project name — resolve it via `px project get`:
 
 ```bash
-# Bulk-annotate traces that already have open-coding notes
-px trace list --include-notes --format raw --no-progress \
-  | jq -r '.[] | select((.notes // []) | length > 0) | .traceId' \
-  | while read tid; do
-      px trace annotate "$tid" \
-        --name failure_category \
-        --label answered_off_topic \
-        --annotator-kind HUMAN
-    done
+project_id=$(px project get "$PHOENIX_PROJECT" --format raw --no-progress | jq -r '.id')
+encoded=$(python3 -c 'import urllib.parse, sys; print(urllib.parse.quote(sys.argv[1]))' \
+  "annotations['coding_session_id'].label == '$CODING_ANNOTATION_IDENTIFIER'")
+echo "Phoenix UI: $PHOENIX_HOST/projects/$project_id/traces?filterCondition=$encoded"
 ```
 
-The same pattern works for span-level notes — swap `px trace` for `px span` and `.traceId` for `.context.span_id`.
+If the user wants to discard everything this run produced (open-coding notes, axial-coding labels, and `coding_session_id` annotations on the server, plus the local sidecars), three identifier-bound deletes handle the server side and one `rm` handles the local sidecars. **Confirm before running** — destructive. Each `px <entity>-annotations delete` call requires `--all` to authorize the unbounded sweep; `--identifier` only narrows. Set `PHOENIX_CLI_DANGEROUSLY_ENABLE_DELETES=true` first if not already exported:
 
-Aside: for Node-based bulk scripts, `@arizeai/phoenix-client` exposes `addSpanAnnotation`, `addSpanNote`, and `addTraceNote`. (No `addTraceAnnotation` is exported today; use the REST endpoint or `px trace annotate` for trace-level annotations.)
+```bash
+for kind in trace span session; do
+  px "$kind-annotations" delete \
+    --identifier "$CODING_ANNOTATION_IDENTIFIER" \
+    --all -y \
+    --format raw --no-progress
+done
+rm -f "$NOTES_SIDECAR" "$AXIAL_SIDECAR"
+```
 
-Aside: `px api graphql` rejects mutations — it cannot write annotations.
+Each `px <entity>-annotations delete` call removes notes, axial-coding labels, and `coding_session_id` annotations together because they share the underlying annotation table; the `rm` clears the local sidecars.
 
 ## Agent Failure Taxonomy
 
@@ -173,6 +209,10 @@ A useful category is:
 
 ## Principles
 
-- **MECE** - Each failure fits ONE category
-- **Actionable** - Categories suggest fixes
-- **Bottom-up** - Let categories emerge from data
+- **One coding annotation identifier per run** — every `annotate` call and every sidecar line carries `$CODING_ANNOTATION_IDENTIFIER`, the same value open coding used; never mint a new id mid-run.
+- **Pass `--identifier` explicitly** — every `px` call gets `--identifier "$CODING_ANNOTATION_IDENTIFIER"`; do not rely on inherited env vars.
+- **Sidecar reads, server writes** — Gather and Quantify read `$NOTES_SIDECAR` and `$AXIAL_SIDECAR` locally; Record writes to the server and updates the sidecar. If an entity appears more than once, the newest `ts` wins.
+- **MECE** — Each failure fits ONE category.
+- **Actionable** — Categories suggest fixes.
+- **Bottom-up** — Let categories emerge from data.
+- **UI-filter annotation always paired** — never write `axial_coding_category` without writing the matching `coding_session_id` annotation; the UI link depends on it.
diff --git a/skills/phoenix-cli/references/open-coding.md b/skills/phoenix-cli/references/open-coding.md
index 7cafa8f4d..7a0b4cf1b 100644
--- a/skills/phoenix-cli/references/open-coding.md
+++ b/skills/phoenix-cli/references/open-coding.md
@@ -1,33 +1,100 @@
 # Open Coding
 
-Free-form note-writing against sampled traces, before any taxonomy exists. After you pick a sample of traces, read each one and write a short, specific observation of what went wrong. These raw notes feed [axial coding](axial-coding.md), where they get grouped into named failure categories — and ultimately into eval targets or fix priorities.
+Free-form note-writing against sampled traces, spans, or sessions, before any taxonomy exists. After you pick a sample at the right unit (see [Choosing the unit of analysis](#choosing-the-unit-of-analysis)), read each one and write a short, specific observation of what went wrong. These raw notes feed [axial coding](axial-coding.md), where they get grouped into named failure categories — and ultimately into eval targets or fix priorities.
 
-**Reach for this whenever** the user wants to look at traces or spans without a fixed taxonomy yet — e.g., "what's going wrong with this agent", "I just instrumented my app, where do I start", "review these traces", "what kinds of mistakes is the model making", "help me make sense of these outputs", or any framing that needs grounded observations before categories.
+**Reach for this whenever** the user wants to look at LLM traffic without a fixed taxonomy yet — e.g., "what's going wrong with this agent", "I just instrumented my app, where do I start", "review these traces", "the chatbot keeps losing context", "what kinds of mistakes is the model making", "help me make sense of these conversations", or any framing that needs grounded observations before categories.
 
-## Choosing the unit
+## Choosing the unit of analysis
 
-Open coding has two scopes that don't have to match:
+The right unit — **trace, span, or session** — depends on the question and the system. Pick deliberately before recording; the choice determines whether you call `px trace`, `px span`, or `px session` throughout, and a wrong default is expensive to undo mid-run.
 
-- **Review scope** — the **trace**. Read input → tool calls → retrieved context → output as one story.
-- **Recording scope** — **default to the trace**. The honest observation is usually trace-shaped ("asked X, got Y; the answer didn't address the question"), and forcing localization to a span at this stage commits to causal attribution you don't yet have data to support — that's axial coding's job.
+The unit is about **where the failure modes you're investigating actually live**:
 
-  Drop to a **span** only when one of:
-  - The span, read in isolation, is still wrong: an exception fired, a tool returned an error response, the output is malformed.
-  - You already know the domain well enough to attribute the failure on sight without inferring across spans.
+- **Trace** — one input → one call graph → one output. Right for classifiers, single-shot summarizers, stateless tool-using agents, single-query RAG. Failure modes that live here: wrong answer, malformed output, missed retrieval, bad tool selection within one request.
+- **Span** — one operation inside a trace. Right for in-isolation mechanical failures (an exception fired, a tool returned an error response, an output is malformed) or when you can attribute on sight to a specific component. Reach for span when the trace as a whole is fine but one piece inside it is the unit of interest.
+- **Session** — a sequence of traces sharing a `session.id`. Right for multi-turn conversational agents, agents with episodic memory, anything where the failure mode is a *trajectory*: context loss across turns, drift from the user's stated goal, the agent forgetting a stated preference, repeated user clarifications. These failures don't exist on any single trace; they only exist *across* traces.
 
-Session-level findings are axial-coding rollup targets, not open-coding notes — Phoenix has REST `/v1/projects/{id}/session_annotations` but no session `add-note` path.
+### Diagnostic — three signals to read
+
+1. **User framing.** *Tilts session*: "conversation", "agent forgot", "drift", "memory", "across turns", "user had to repeat themselves". *Tilts trace*: "this trace", "this call", "the response was wrong", "wrong output". *Tilts span*: "exception", "error response", "malformed", "the retrieval failed".
+
+2. **Data shape.** Probe before the loop. The session id lives at `rootSpan.attributes["session.id"]` (it is *not* a top-level field on the trace JSON), and is `""` for traces that aren't session-wired — filter both:
+
+   ```bash
+   px trace list --limit 200 --format raw --no-progress \
+     | jq '
+       [ .[] | .rootSpan.attributes["session.id"] // empty | select(. != "") ]
+       | { with_session: length,
+           distinct_sessions: (group_by(.) | length),
+           median_traces_per_session:
+             (group_by(.) | map(length) | sort | .[length/2|floor] // 0) }
+     '
+   ```
+
+   `with_session: 0` → sessions not wired; trace is the grain. `median_traces_per_session: 1` → single-trace sessions; still trace. `median_traces_per_session: 5+` → sessions are meaningful; session is plausibly right.
+
+3. **System type.** Open one recent trace and inspect the root span's input. A single user message → one turn or one shot. A message *array* (`[{role: user}, {role: assistant}, ...]`) → that's a turn within a longer dialogue; the dialogue lives at the session level.
+
+   ```bash
+   px trace get <trace-id> --format raw \
+     | jq '.rootSpan.attributes["input.value"] | (try fromjson catch .) | (type, length?)'
+   ```
+
+### Commit out loud, then proceed
+
+State the unit explicitly before recording any note:
+
+> "Question: 'the chatbot keeps losing context'. Data: median 7 traces per session, message-array inputs. Recording at the **session** level; will drop to **trace** for single-turn observations, **span** for mechanical failures."
+
+The unit can shift if data demands it — a trace-level investigation that surfaces "the agent never remembers earlier turns" should pivot to session. Record the observation, then refocus the next batch. The unit is a starting hypothesis, not a contract.
+
+## Coding annotation identifier (pick this first)
+
+Every artifact this workflow produces — open-coding notes, axial-coding labels, the local sidecar files, and the UI-filter annotation — is tagged with one **coding annotation identifier** so the run is queryable, revertible, and viewable as a unit. Pick a **descriptive, unique** identifier before recording any notes. Format suggestion:
+
+    coding-run:<short-topic>-<YYYY-MM-DD>
+
+Examples: `coding-run:chatbot-context-loss-2026-05-06`, `coding-run:agent-tool-misuse-q2`. Descriptive ids carry meaning for whoever opens the data later — better than an opaque uuid. The `coding-run:` prefix is a visual convention; the value is the workflow's coding annotation identifier, not a `px session` id.
+
+> **Workflow term vs. server annotation name.** The skill calls this value the **coding annotation identifier**. The server-side annotation NAME used for the UI filter is unchanged — `coding_session_id` — for data compatibility with rows already written. Don't try to rename it.
+
+Pass the identifier explicitly on every `px` call. A shell variable for readability is fine, but **do not rely on shell inheritance** — many agent harnesses spawn each command in a fresh subshell, so `CODING_ANNOTATION_IDENTIFIER` may not propagate.
+
+```bash
+CODING_ANNOTATION_IDENTIFIER="coding-run:chatbot-context-loss-2026-05-06"
+```
+
+The local sidecar lives at `.px/coding/<sanitized-identifier>.jsonl` (CWD-relative, matching the `.px/docs` precedent). Sanitization rule: replace any character not matching `[a-zA-Z0-9_-]` with `-` before using the value in the filename — colons, slashes, and other shell-fragile characters get normalized. For `CODING_ANNOTATION_IDENTIFIER="coding-run:chatbot-context-loss-2026-05-06"` the sidecar path is `.px/coding/coding-run-chatbot-context-loss-2026-05-06.jsonl`.
+
+Verify this run hasn't already started — uniqueness is a **local file check**, not a server query:
+
+```bash
+SLUG=$(echo -n "$CODING_ANNOTATION_IDENTIFIER" | sed 's/[^a-zA-Z0-9_-]/-/g')
+SIDECAR=".px/coding/${SLUG}.jsonl"
+test ! -f "$SIDECAR" || { echo "Sidecar already exists at $SIDECAR — pick a new identifier or delete the file"; exit 1; }
+mkdir -p .px/coding
+```
+
+If `$SIDECAR` already exists, append a disambiguator (`-v2`, `-dustin`, etc.) to `CODING_ANNOTATION_IDENTIFIER`, re-derive `SLUG`, and re-check. The agent harness can run open coding and axial coding in independent invocations: each step re-derives `SLUG` from `CODING_ANNOTATION_IDENTIFIER` and reads/writes the same file.
 
 ## Process
 
-1. **Inspect** — fetch a trace from your sample
-2. **Read** — look at input, output, exceptions, tool calls, retrieved context
-3. **Note** — write one specific sentence describing what went wrong (or skip if correct)
-4. **Record** — attach the note to the trace with `px trace add-note` (default), or to a span with `px span add-note` for in-isolation/mechanical failures
-5. **Iterate** — move to the next trace; repeat until the sample is exhausted or saturation hits
+1. **Pick a coding annotation identifier** — choose a descriptive value and verify the sidecar file does not yet exist (see [Coding annotation identifier](#coding-annotation-identifier-pick-this-first))
+2. **Pick the unit** — work through [Choosing the unit of analysis](#choosing-the-unit-of-analysis) and commit to trace, span, or session
+3. **Inspect** — fetch one entity at the chosen unit (trace / span / session)
+4. **Read** — input, output, exceptions, tool calls, retrieved context, and (at session level) the trajectory across child traces
+5. **Note** — write one specific sentence describing what went wrong (or skip if correct)
+6. **Record** — `px {trace,span,session} add-note <id> --text "..." --identifier "$CODING_ANNOTATION_IDENTIFIER" --format raw --no-progress`, add/update one JSONL sidecar row for the note, then write the matching [UI-filter annotation](#ui-filter-annotation)
+7. **Iterate** — move to the next entity; repeat until the sample is exhausted or saturation hits
+8. **Hand off** — axial coding reads the sidecar directly (no shared shell required); see [Wrapping up](#wrapping-up) for the UI link
 
 ## Inspection
 
-Use `px` to read trace and span context before writing a note. Open coding reviews by **trace** — read input → tool calls → retrieved context → output as a unit. Record on the trace by default; drill to a specific span only when the failure is mechanical (exception, error response, malformed output) or you can attribute on sight (see [Choosing the unit](#choosing-the-unit)).
+Use `px` to read context at the unit committed in [Choosing the unit](#choosing-the-unit-of-analysis):
+
+- **Trace unit** — read one trace's input → tool calls → retrieved context → output as one story.
+- **Span unit** — read one operation's input/output and surrounding spans for context.
+- **Session unit** — read the sequence of traces in order; the trajectory (turns, retrievals, tool-call patterns *across* traces) is the data, not any single trace's inputs and outputs.
 
 > **Don't filter the sample by `--status-code ERROR`.** OTel's `status_code` only flips to `ERROR` when an instrumentor catches a raised Python exception (network failure, 5xx, parse error). Hallucinations, wrong tone, retrieval misses, and bad tool selection all complete cleanly and arrive as `OK` or `UNSET`. Sampling for open coding by `--status-code ERROR` excludes the population this workflow exists to surface.
 
@@ -63,38 +130,65 @@ Always pipe through `jq` with `--format raw --no-progress` when scripting.
 
 ## Recording Notes
 
-Default write path is `px trace add-note <trace-id> --text "..."` — most observations are trace-shaped and shouldn't pre-commit to localization. Drop to `px span add-note <span-id>` when the failure is in-isolation wrong (exception, error response, malformed output) or you already know the failure structure on sight.
+Use the `add-note` command matching the unit committed in [Choosing the unit](#choosing-the-unit-of-analysis): `px trace add-note`, `px span add-note`, or `px session add-note`. Every call carries an explicit `--identifier "$CODING_ANNOTATION_IDENTIFIER"` and `--format raw --no-progress`.
+
+Passing `--identifier "$CODING_ANNOTATION_IDENTIFIER"` does two things:
+- Tags the note row with the coding annotation identifier on the server, so the cleanup `px <entity>-annotations delete --identifier "$CODING_ANNOTATION_IDENTIFIER" --all` sweep removes every artifact this run produced.
+- Makes the call **upsert** on `(entity_id, name='note', identifier)` — re-running open coding on the same entity within the same coding annotation identifier overwrites the prior note instead of appending a second row. (Without `--identifier`, the server stamps a unique `px-{kind}-note:<uuid>` and each call appends.)
+
+After every successful `add-note`, record one JSONL line in `$SIDECAR`. The sidecar is what axial coding reads — no server round-trip. It is a content handoff, not code: keep it readable, inspect it directly, and use whatever simple tooling is convenient.
+
+**Sidecar JSONL line shape (one per `add-note`):**
+
+```json
+{"entity_kind":"trace","entity_id":"<trace-id>","note":"<text>","identifier":"<original identifier value, unsanitized>","ts":"<ISO-8601 UTC>"}
+```
+
+Fields:
+- `entity_kind` — `"trace"`, `"span"`, or `"session"` (matches the `add-note` subcommand used)
+- `entity_id` — the entity argument passed to `add-note` (trace id, span id, or session id)
+- `note` — the `--text` value, verbatim
+- `identifier` — the **original** `$CODING_ANNOTATION_IDENTIFIER` value, unsanitized; the sanitized form lives only in the filename
+- `ts` — ISO-8601 UTC timestamp (e.g. `2026-05-08T17:14:09Z`) of the local append
+
+If you revise a note for the same entity under the same coding annotation identifier, either replace that row or append a newer row. When duplicate `(entity_kind, entity_id)` rows exist, the newest `ts` is the current note. This matches the server upsert behavior of `add-note --identifier`.
+
+Minimal trace example:
 
 ```bash
-# Trace-level note (default)
-px trace add-note <trace-id> --text "Asked about returns; final answer covered shipping policy instead"
-
-# Span-level note (mechanical or attributable-on-sight failures)
-px span add-note <span-id> --text "Tool call returned 500 — vendor API unreachable"
-
-# Interactive loop — walk traces, write a trace-level note per failing trace
-px trace list --last-n-minutes 60 --limit 50 --format raw --no-progress \
-  | jq -r '.[].traceId' \
-  | while read tid; do
-      echo "── trace $tid ──"
-      px trace get "$tid" --format raw | jq '
-        {input: .rootSpan.attributes["input.value"],
-         output: .rootSpan.attributes["output.value"],
-         spans: (.spans | sort_by(.start_time) | map({name, status_code}))}
-      '
-      read -p "Note for $tid (blank to skip): " note
-      [ -z "$note" ] && continue
-      px trace add-note "$tid" --text "$note"
-    done
+px trace add-note <trace-id> \
+  --text "Asked about returns; final answer covered shipping policy instead" \
+  --identifier "$CODING_ANNOTATION_IDENTIFIER" \
+  --format raw --no-progress
 ```
 
+Then add a matching JSONL row to `$SIDECAR` using the line shape above. For span or session notes, change `entity_kind`, `entity_id`, and the `px` subcommand accordingly.
+
 Bulk auto-tagging by status code (e.g. `px span list --status-code ERROR | xargs ... add-note "error"`) is **not open coding** — open coding is manual, observation-grounded, and ranges over all failure modes, not just spans where Python raised. Skip the bulk-by-status-code shortcut; it produces fewer, less informative notes than walking traces.
 
+### UI-filter annotation
+
+Every entity that receives an open-coding note (or an axial-coding label later) also needs a UI-filter annotation so the Phoenix UI can filter by coding annotation identifier. Phoenix's UI filter language is name-based, not identifier-based — there is no UI primitive for filtering by `identifier`, so an annotation whose **name** is the constant `coding_session_id` and whose **label** is the coding annotation identifier value is what the wrap-up UI link actually filters on.
+
+The annotation NAME `coding_session_id` is the load-bearing data key on the server and is **unchanged** in this rewrite. The skill's workflow term is "coding annotation identifier"; the server key stays `coding_session_id` for compatibility with rows already written.
+
+Run this once per touched entity, alongside the `add-note` (and again later when axial coding labels a different entity):
+
+```bash
+px trace annotate <trace-id> \
+  --name coding_session_id \
+  --label "$CODING_ANNOTATION_IDENTIFIER" \
+  --identifier "$CODING_ANNOTATION_IDENTIFIER"
+# or px span annotate / px session annotate at matching levels
+```
+
+The annotation's `--identifier` matches `$CODING_ANNOTATION_IDENTIFIER`, so the [wrap-up DELETE](#wrapping-up) cleans it up in the same call as the notes and the axial-coding labels.
+
 **Fallback write paths (one-line asides):**
 
-- `POST /v1/trace_notes` and `POST /v1/span_notes` — accept one `{data: {trace_id|span_id, note}}` per request; use for scripted writes outside the CLI.
-- `@arizeai/phoenix-client` `addTraceNote` and `addSpanNote` wrap the same endpoints.
-- `px api graphql` rejects mutations with `"Only queries are permitted."` — use `px trace/span add-note` or the REST endpoints instead.
+- `POST /v1/trace_notes` and `POST /v1/span_notes` and `POST /v1/session_notes` — accept one `{data: {trace_id|span_id|session_id, note, identifier}}` per request; the optional `identifier` field upserts on `(entity_id, name='note', identifier)` when non-empty.
+- `@arizeai/phoenix-client` `addTraceNote`, `addSpanNote`, and `addSessionNote` wrap the same endpoints and accept an optional `identifier` field on the note object.
+- The GraphQL endpoint rejects mutations with `"Only queries are permitted."` — write through `px {trace,span,session} add-note` or the REST endpoints above.
 
 ## What Makes a Good Note
 
@@ -118,10 +212,43 @@ Stop writing notes when observations stop being new. Signals:
 
 At saturation, move on to [axial coding](axial-coding.md) to group what you have. Continuing past saturation adds traces but not insight. You do not need to annotate every trace — annotating correct ones dilutes signal.
 
+## Listing what this run produced
+
+The local sidecar is the handoff record for notes written this run. Inspect it directly. Each line is one note record; if the same entity appears more than once, use the newest `ts` as the current note. Missing-file behavior: an absent sidecar means open coding has not yet started for this coding annotation identifier; treat that as zero notes, not an error. Malformed lines are line-local: fix or drop the bad line without editing neighbors.
+
+## Wrapping up
+
+When the run is done, share the Phoenix UI link with the user. The link filters the project's traces page by the `coding_session_id` annotation written alongside each note. The UI route `/projects/:projectId` expects an encoded GraphQL node ID, not a project name — resolve it via `px project get`:
+
+```bash
+project_id=$(px project get "$PHOENIX_PROJECT" --format raw --no-progress | jq -r '.id')
+encoded=$(python3 -c 'import urllib.parse, sys; print(urllib.parse.quote(sys.argv[1]))' \
+  "annotations['coding_session_id'].label == '$CODING_ANNOTATION_IDENTIFIER'")
+echo "Phoenix UI: $PHOENIX_HOST/projects/$project_id/traces?filterCondition=$encoded"
+```
+
+If the user wants to discard everything this run produced, three identifier-bound deletes handle the server side and one `rm` handles the local sidecars. **Confirm with the user before running** — this is destructive. Each call requires `--all` (or both `--start-time` and `--end-time`) to authorize the sweep; `--identifier` filters further but never authorizes on its own. Set `PHOENIX_CLI_DANGEROUSLY_ENABLE_DELETES=true` first if not already exported:
+
+```bash
+for kind in trace span session; do
+  px "$kind-annotations" delete \
+    --identifier "$CODING_ANNOTATION_IDENTIFIER" \
+    --all -y \
+    --format raw --no-progress
+done
+rm -f "$SIDECAR" ".px/coding/${SLUG}-axial.jsonl"
+```
+
+Each `px <entity>-annotations delete` call covers notes, structured annotations, and the `coding_session_id` annotation in one shot because they share the underlying annotation table.
+
 ## Principles
 
+- **One coding annotation identifier per run** — every server artifact and every sidecar line carries the same `$CODING_ANNOTATION_IDENTIFIER`; never mint a per-stage id.
+- **Pass `--identifier` explicitly** — every `px` call gets `--identifier "$CODING_ANNOTATION_IDENTIFIER"`; do not rely on inherited env vars across harness-spawned subshells.
+- **Sidecar is the handoff record for notes** — axial coding reads from the local sidecar, not from the server; if an entity appears more than once, the newest `ts` wins.
 - **Free-form over structured** — do not pre-commit to a taxonomy during open coding; categories emerge in axial coding.
 - **Specific over general** — quote or paraphrase the observed failure; vague labels ("bad response") carry no signal.
 - **Context before labeling** — inspect input, output, and retrieved context before writing any note.
 - **Iterate before categorizing** — work through the full sample first; resist grouping while still collecting.
 - **Skip is valid** — a correct span needs no note; annotating everything dilutes signal.
+- **Revert is opt-in** — the wrap-up DELETE only runs after explicit user confirmation; the default path prints the UI link and stops.
diff --git a/skills/phoenix-evals/references/experiments-datasets-python.md b/skills/phoenix-evals/references/experiments-datasets-python.md
index 7ec5ace01..37174abb0 100644
--- a/skills/phoenix-evals/references/experiments-datasets-python.md
+++ b/skills/phoenix-evals/references/experiments-datasets-python.md
@@ -4,6 +4,8 @@ Creating and managing evaluation datasets.
 
 ## Creating Datasets
 
+`create_dataset()` upserts: if a dataset with the same name already exists it is updated in-place; re-running with identical inputs is a no-op.
+
 ```python
 from phoenix.client import Client
 
@@ -21,6 +23,19 @@ dataset = client.datasets.create_dataset(
     ],
 )
 
+# With stable example IDs for targeted updates across uploads
+dataset = client.datasets.create_dataset(
+    name="qa-test-v1",
+    examples=[
+        {
+            "id": "q-001",                      # stable ID — server updates this row, not inserts
+            "input": {"question": "What is 2+2?"},
+            "output": {"answer": "4"},
+            "metadata": {"category": "math"},
+        },
+    ],
+)
+
 # From DataFrame
 dataset = client.datasets.create_dataset(
     dataframe=df,
@@ -28,6 +43,8 @@ dataset = client.datasets.create_dataset(
     input_keys=["question"],
     output_keys=["answer"],
     metadata_keys=["category"],
+    split_key="split",        # single split column (use this instead of deprecated split_keys)
+    example_id_key="id",      # column containing stable example IDs
 )
 ```
 
@@ -58,6 +75,9 @@ df = dataset.to_dataframe()
 | `input_keys` | Columns for task input |
 | `output_keys` | Columns for expected output |
 | `metadata_keys` | Additional context |
+| `example_id_key` | Column with stable example IDs; server updates the matching row instead of inserting |
+| `split_key` | Single column for split assignment (replaces deprecated `split_keys`) |
+| `split_keys` | **Deprecated** — use `split_key` (singular) instead |
 
 ## Using Evaluators in Experiments
 
@@ -128,6 +148,8 @@ See `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example
 
 ## Best Practices
 
-- **Versioning**: Create new datasets (e.g., `qa-test-v2`), don't modify
+- **Upsert by default**: Re-upload to the same name to update in-place; use `example_id_key` so the server targets specific rows instead of treating every upload as new data
+- **Versioning**: Version with tags or new names (e.g., `qa-test-v2`) when you want a clean snapshot, not just incremental edits
 - **Metadata**: Track source, category, difficulty
 - **Balance**: Ensure diverse coverage across categories
+- **Avoid `split_keys`**: Pass `split_key` (singular) — `split_keys` is deprecated and emits a `DeprecationWarning`
diff --git a/skills/phoenix-evals/references/experiments-datasets-typescript.md b/skills/phoenix-evals/references/experiments-datasets-typescript.md
index d8418c3ce..1ac20e363 100644
--- a/skills/phoenix-evals/references/experiments-datasets-typescript.md
+++ b/skills/phoenix-evals/references/experiments-datasets-typescript.md
@@ -4,6 +4,8 @@ Creating and managing evaluation datasets.
 
 ## Creating Datasets
 
+`createDataset()` upserts: if a dataset with the same name already exists it is updated to match the provided examples. Re-running with identical inputs is a no-op.
+
 ```typescript
 import { createClient } from "@arizeai/phoenix-client";
 import { createDataset } from "@arizeai/phoenix-client/datasets";
@@ -21,15 +23,32 @@ const { datasetId } = await createDataset({
     },
   ],
 });
+
+// With stable example IDs for targeted updates across uploads
+const { datasetId } = await createDataset({
+  client,
+  name: "qa-test-v1",
+  examples: [
+    {
+      id: "q-001",                        // stable ID — server updates this row, not inserts
+      input: { question: "What is 2+2?" },
+      output: { answer: "4" },
+      metadata: { category: "math" },
+    },
+  ],
+});
 ```
 
 ## Example Structure
 
 ```typescript
-interface DatasetExample {
+interface Example {
   input: Record<string, unknown>;    // Task input
-  output?: Record<string, unknown>;  // Expected output
-  metadata?: Record<string, unknown>; // Additional context
+  output?: Record<string, unknown> | null;  // Expected output
+  metadata?: Record<string, unknown> | null; // Additional context
+  splits?: string | string[] | null; // Split assignment ("train", ["train", "easy"], etc.)
+  spanId?: string | null;            // OTEL span ID to link back to source trace
+  id?: string | null;                // Stable user-provided ID; server updates matching row
 }
 ```
 
@@ -64,6 +83,7 @@ const all = await listDatasets({ client });
 
 ## Best Practices
 
-- **Versioning**: Create new datasets, don't modify existing
+- **Upsert by default**: Re-upload to the same name to update in-place; use `id` on examples so the server targets specific rows instead of treating every upload as new data
+- **Versioning**: Version with new names (e.g., `qa-test-v2`) when you want a clean snapshot, not just incremental edits
 - **Metadata**: Track source, category, provenance
-- **Type safety**: Use TypeScript interfaces for structure
+- **Type safety**: Use the `Example` type from `@arizeai/phoenix-client/datasets`