feat: emit agent spans in RealtimeSession for SDK tracing#3573
feat: emit agent spans in RealtimeSession for SDK tracing#3573jordanchendev wants to merge 8 commits into
Conversation
RealtimeSession.__aenter__ now creates and starts an agent_span for the starting agent, and finishes it when the session closes. On handoff, the outgoing agent's span is finished and a new span is started for the incoming agent, matching the tracing behaviour of the regular Runner. The span records the agent name, tool names, and handoff target names. Passing tracing_disabled=True in run_config suppresses all spans. Fixes openai#1845
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2d639b6d2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self._current_agent_span = self._make_agent_span(self._current_agent) | ||
| self._current_agent_span.start(mark_as_current=True) |
There was a problem hiding this comment.
Avoid resetting handoff span tokens across tasks
When a handoff function call is handled with the default async_tool_calls=True, this code runs inside the background task created by _enqueue_tool_call_task, so start(mark_as_current=True) stores a contextvars token for that task. Later close()/__aexit__ calls _cleanup() in the caller's context and finish(reset_current=True) on this same span, which raises ValueError: Token ... was created in a different Context and skips the rest of cleanup after any handoff.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit ed34935. _cleanup now uses reset_current=False, and the new agent span started during a handoff uses mark_as_current=False, avoiding cross-context token errors when async_tool_calls=True.
| """Create a new agent span for the given agent, respecting tracing_disabled.""" | ||
| disabled: bool = bool(self._run_config.get("tracing_disabled", False)) | ||
| handoff_names = [h.agent_name if isinstance(h, Handoff) else h.name for h in agent.handoffs] | ||
| tool_names = [t.name for t in agent.tools if isinstance(t, FunctionTool)] |
There was a problem hiding this comment.
Derive span tool names from resolved tools
For agents that use is_enabled on function tools or tools supplied through MCP, this reads only agent.tools instead of the resolved list from agent.get_all_tools(...) that is actually sent to the model. In those supported configurations, disabled function tools can be reported in the span and MCP/enabled dynamic tools can be omitted, making the new tracing metadata inaccurate.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit ed34935. _make_agent_span no longer reads agent.tools; instead, both aenter and the handoff handler call agent.get_all_tools() after obtaining model settings and write the resolved names to span_data.tools via get_tool_trace_name_for_tool(), matching the pattern used by Runner.
Two issues raised in automated review of the initial commit: P1 (crash): When async_tool_calls=True (the default), handoffs execute inside asyncio.create_task(). start(mark_as_current=True) stores a contextvars token in the background task's context. _cleanup() runs in the main task and cannot reset that token; finish(reset_current=True) raises ValueError: Token was created in a different Context. Fix: _cleanup uses reset_current=False (session is ending; no need to restore the previous current span). The new agent span started during a handoff uses mark_as_current=False for the same reason. P2 (inaccurate metadata): _make_agent_span read agent.tools (static list) instead of the resolved list from get_all_tools(), so disabled function tools appeared in the span and MCP tools were omitted. Fix: tool names are no longer set in _make_agent_span; both __aenter__ and the handoff handler call agent.get_all_tools() after obtaining model settings and write the resolved names to span_data.tools via get_tool_trace_name_for_tool(), matching the pattern used by Runner.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed34935eea
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._current_agent_span is not None: | ||
| self._current_agent_span.finish(reset_current=False) |
There was a problem hiding this comment.
Reset the initial agent span on cleanup
Fresh evidence is that the initial session span is still started with mark_as_current=True, while cleanup now always calls finish(reset_current=False). After any successfully entered realtime session closes without a handoff, the ended agent span remains in the current context; later spans in the same task and trace, or in a subsequent trace started in that task, will be parented under this stale span (even across trace IDs) instead of the active trace/root. The cross-task handoff case needs care, but the initial span token created by __aenter__ should still be reset when cleanup runs in that same context.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit ceb50e0. aenter now calls span.start(mark_as_current=False) then Scope.set_current_span() explicitly, storing the context-var token in self._initial_span_token. _cleanup (always in the same task as aenter via aexit) calls Scope.reset_current_span() on that token after finishing the current span, so the stale span is no longer current in the context regardless of whether a handoff occurred.
| # Start an agent span for the initial agent. | ||
| self._current_agent_span = self._make_agent_span(self._current_agent) | ||
| self._current_agent_span.start(mark_as_current=True) |
There was a problem hiding this comment.
Close the span if session entry fails
When any later step in __aenter__ fails after this span is started, such as resolving instructions/tools or model.connect(), Python does not call __aexit__ for the context manager. In that failure path the new span is left unfinished and current in the task context, so subsequent tracing can inherit a dangling parent and the failed attempt is never exported as an ended span. Wrap the fallible setup after span creation in a cleanup path or defer starting the span until entry can complete.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit ceb50e0. The fallible setup in aenter (get_updated_model_settings, get_all_tools, model.connect) is now wrapped in try/except BaseException. If any step raises, the handler calls span.finish(reset_current=False) and Scope.reset_current_span() before re-raising, so the span is always finished and the context-var is always reset even when aexit is never called.
Two issues fixed in RealtimeSession tracing: P1 – After closing a session without a handoff, the agent span started in __aenter__ with mark_as_current=True left its context-var token unreset because _cleanup always used reset_current=False. Subsequent spans in the same task were incorrectly parented under the stale (finished) span. Fix: manage the context-var token explicitly via Scope.set_current_span() in __aenter__ and store it in self._initial_span_token. _cleanup now calls Scope.reset_current_span() on that token, which is always safe because _cleanup runs via __aexit__ in the same task as __aenter__. The current span (which may be a handoff span started with mark_as_current=False) is still finished with reset_current=False to avoid cross-context token errors. P2 – If __aenter__ raised an exception after span start (e.g. model.connect() failed), __aexit__ was not called, leaving the span unfinished and current in the context. Fix: wrap the fallible setup section in a try/except that finishes the span and resets the token before re-raising. Fixes openai#1845
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ceb50e07e9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # already called finish() on this span from a background task. | ||
| self._current_agent_span = self._make_agent_span(self._current_agent) | ||
| self._current_agent_span.start(mark_as_current=False) | ||
| self._initial_span_token = Scope.set_current_span(self._current_agent_span) |
There was a problem hiding this comment.
Avoid installing no-op spans as current
When the session is entered without an already-active trace (or with tracing_disabled=True), _make_agent_span() returns a no-op span, but this still stores it in the current span context for the whole session. If user code later starts a real trace() inside the session and creates any spans, the tracing provider treats the current no-op span as a no-op parent and drops those spans until _cleanup() resets the token. Only set the current span when a real agent span was created, or otherwise avoid poisoning the ambient tracing context.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit b61f80f. aenter now checks isinstance(span, NoOpSpan) before calling Scope.set_current_span(). When no active trace exists (or tracing_disabled=True), _make_agent_span() returns a NoOpSpan and we skip installing it as current, so the ambient context is unchanged for the duration of the session.
| self._current_agent_span = self._make_agent_span(self._current_agent) | ||
| self._current_agent_span.start(mark_as_current=False) |
There was a problem hiding this comment.
Parent handoff spans to the active trace, not the old agent
After a handoff, the outgoing agent span is finished with reset_current=False, so the task's current span is still the outgoing agent when this new span is created. Because _make_agent_span() relies on the ambient current span, every incoming agent span is recorded as a child of the ended previous agent rather than as the next agent span under the trace/task like the regular runner does after handoffs, which makes multi-agent realtime traces misleading.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit b61f80f. Before calling _make_agent_span() for the incoming agent, the handoff handler now temporarily sets the current span to None via Scope.set_current_span(None) and immediately restores it. This ensures provider.create_span() sees no current span and assigns parent_id=None, making the new agent span a sibling under the trace root rather than a child of the finished outgoing span.
| asynchronously via get_all_tools() to include MCP tools and respect is_enabled. | ||
| """ | ||
| disabled: bool = bool(self._run_config.get("tracing_disabled", False)) | ||
| handoff_names = [h.agent_name if isinstance(h, Handoff) else h.name for h in agent.handoffs] |
There was a problem hiding this comment.
Filter handoffs before recording span metadata
This records handoff names directly from agent.handoffs, but realtime only exposes handoffs after _get_handoffs() applies each handoff's is_enabled predicate. For agents with disabled or context-gated handoffs, the trace will report handoff tools that were never sent to the model, so the new span metadata is inaccurate in exactly the configurations where handoff availability is dynamic.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit b61f80f. _make_agent_span() no longer reads agent.handoffs. Both aenter and the handoff handler now call _get_handoffs(agent, context_wrapper) (which applies each Handoff.is_enabled predicate via asyncio.gather) and write the filtered result to span_data.handoffs, matching the pattern Runner uses for get_handoffs().
… filter handoffs Three issues fixed per tracing conventions (local_dev/notes/tracing-conventions.md): Convention 3 — NoOpSpan poisoning: __aenter__ now skips Scope.set_current_span() when _make_agent_span() returns a NoOpSpan (no active trace, or tracing_disabled). Installing a NoOpSpan as current would cause provider.create_span() to return NoOpSpan for every subsequent child span in the same context (_is_noop_span check in provider.py). Convention 5 — Span parent fixed at create_span() call time: Before creating the incoming agent span in the handoff handler (background task), the current span context is temporarily cleared via Scope.set_current_span(None) and immediately restored. Without this, the outgoing agent span (still "current" in the background task's copied context because finish(reset_current=False) does not reset the context var) would be used as parent_id for the new span, producing incorrect parent-child nesting instead of sibling spans under the trace root. Convention 7 — is_enabled filtering for handoffs: _make_agent_span() no longer reads agent.handoffs directly. Both __aenter__ and the handoff handler now call _get_handoffs() (which applies each Handoff.is_enabled predicate) and write the filtered result to span_data.handoffs, matching the pattern used by Runner for handoff metadata. Tests added for all three behaviors: - test_no_active_trace_does_not_poison_span_context - test_disabled_handoff_excluded_from_span_metadata - test_handoff_span_is_sibling_not_child_of_initial_span Fixes openai#1845
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b61f80f557
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._initial_span_token is not None: | ||
| Scope.reset_current_span(self._initial_span_token) |
There was a problem hiding this comment.
Guard span-token reset to the entry task
Fresh evidence is that _cleanup() is also reachable from __aiter__ when _stored_exception is set, and close() is public, so cleanup is not guaranteed to run in the same asyncio task that created _initial_span_token in __aenter__. In those cross-task paths, Scope.reset_current_span() raises ValueError: Token ... was created in a different Context, which stops cleanup before cancelling tasks, removing the listener, and closing the model.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 0470b2d.
_cleanup() is reachable from __aexit__ (guaranteed main task), close() (any task), and __aiter__ (any task). The fix decomposes reset responsibility by task:
__aexit__now resets_initial_span_tokendirectly before callingclose()— this path is always in the same task as__aenter__, so no cross-task issue._cleanupwraps the token reset intry/except ValueErroras a safety net for calls from other tasks. A comment explains that__aexit__will handle the actual reset from the correct task.
Regression test test_cleanup_from_different_task_does_not_raise was RED before this fix (ValueError propagated and left session open) and is GREEN now.
| resolved_tools = await self._current_agent.get_all_tools(self._context_wrapper) | ||
| self._current_agent_span.span_data.tools = [ | ||
| n for t in resolved_tools if (n := get_tool_trace_name_for_tool(t)) is not None | ||
| ] or None | ||
| enabled_handoffs = await self._get_handoffs( |
There was a problem hiding this comment.
Reuse the resolved tools for span metadata
_get_updated_model_settings_from_agent() has already resolved the exact tools and handoffs used for the model configuration, but this block calls get_all_tools() and _get_handoffs() again just to populate the span. For agents with dynamic is_enabled predicates or MCP tool listings, the second resolution can have side effects or return a different set, so span_data can report tools/handoffs that were not the ones sent to the model; populate the span from the resolved lists already in the model settings instead.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 0470b2d.
_get_updated_model_settings_from_agent() now returns a 3-tuple (settings, tools, handoffs). Callers unpack the tuple and reuse the already-resolved lists:
initial_settings, resolved_tools, enabled_handoffs = (
await self._get_updated_model_settings_from_agent(...)
)
# populate span from the same resolved lists, no second resolution
span_data.tools = [t.name for t in resolved_tools]
span_data.handoffs = [h.agent_name for h in enabled_handoffs] or NoneThe handoff handler and update_agent() also unpack the tuple, eliminating all duplicate get_all_tools() / _get_handoffs() calls. Test callers updated to unpack the tuple accordingly.
…_settings_from_agent Eliminates duplicate get_all_tools() / _get_handoffs() calls by returning already-computed results as a tuple. Callers in __aenter__, update_agent(), and handoff handler now unpack the tuple; test callers updated to match. Also guard _cleanup token reset against cross-task ValueError; defer span context reset to __aexit__ which is guaranteed to run in the same task as __aenter__. Add 5 new regression tests covering NoOpSpan poisoning, disabled handoff filtering, sibling span hierarchy, cross-task cleanup, and direct close() context reset.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0470b2d0e3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| updated_settings["tracing"] = None | ||
|
|
||
| return updated_settings | ||
| return updated_settings, list(tools or []), list(handoffs or []) |
There was a problem hiding this comment.
Use final model settings for span metadata
When model_config["initial_model_settings"] overrides tools or handoffs, updated_settings is changed above and that final config is what is passed to model.connect, but this return still gives callers the pre-override agent-resolved tools/handoffs. In that supported setup the model can receive different or no tools while the emitted AgentSpanData reports the agent defaults, so the new tracing metadata is inaccurate; derive the returned metadata from the final settings after applying starting_settings.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 9408465.
_get_updated_model_settings_from_agent now derives the returned tools and handoffs after starting_settings.update() is applied, so callers always get the final values that were sent to the model:
# Apply starting_settings (from model config) — may override tools and handoffs.
if starting_settings:
updated_settings.update(starting_settings)
# Return the final tools/handoffs AFTER overrides so span metadata matches the model.
final_tools = list(updated_settings.get("tools") or [])
final_handoffs = list(updated_settings.get("handoffs") or [])
return updated_settings, final_tools, final_handoffsNew test test_span_tool_metadata_reflects_model_config_override verifies that when model_config["initial_model_settings"]["tools"] = [] overrides the agent's tools, the span shows the empty list (not the agent defaults).
| if self._current_agent_span is not None: | ||
| self._current_agent_span.finish(reset_current=False) |
There was a problem hiding this comment.
Clear finished agent span after handoff
With the default async_tool_calls=True, the handoff runs in a background task and this finishes the outgoing agent span while the task that entered the session still has that same span installed as its current span until __aexit__. Any SDK/custom span the application starts after the handoff but before closing the session is then parented under an already-ended previous agent because DefaultTraceProvider.create_span() reads Scope.get_current_span(), so handoff traces can mis-parent user spans to the wrong agent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 9408465 with an architectural change.
The root cause is that asyncio tasks inherit a snapshot of the parent context — a background task cannot update the main task's ContextVar. So after a handoff runs in a bg task and finishes the outgoing span, the main task's ContextVar still holds the old finished span as current. There is no way to fix this while keeping the pattern of installing agent spans as the ContextVar current span.
The fix: never install agent spans as the ContextVar current span. Scope.set_current_span is removed from __aenter__, _cleanup, and the handoff handler. Agent spans are created, tracked, and finished via self._current_agent_span without touching the context var. Consequences:
- No stale span: the main task's context is always the enclosing trace root, before and after any handoff.
- Sibling relationship is now natural: the background task inherits trace-root as current at task-creation time, so the incoming agent span's parent is the trace root without needing the
Scope.set_current_span(None)trick. _initial_span_tokenand all associated cross-task ValueError handling are removed entirely.
…adata from final settings Resolves two root-cause issues: 1. Stale current span after handoff (#3353121454): installing the initial agent span as the ContextVar current span in the main task causes it to remain current even after a handoff finishes that span in a background task. asyncio tasks inherit a snapshot of the parent context — the background task cannot update the main task's ContextVar. Remove all Scope.set_current_span calls; agent spans are created and finished without being set as current. This also eliminates _initial_span_token and all related cross-task token- reset complexity. 2. Span metadata mismatch after model_config override (#3353121446): _get_updated_model_settings_from_agent returned the pre-override tools and handoffs even when starting_settings had overridden them. Now returns updated_settings["tools"/"handoffs"] after the update() call so span metadata matches what was actually sent to the model. Add test_span_tool_metadata_reflects_model_config_override to cover case 2. Update docstrings on no-longer-ContextVar-dependent tests to reflect new design.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94084651fa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| updated_settings, _, _ = await self._get_updated_model_settings_from_agent( | ||
| starting_settings=None, | ||
| agent=self._current_agent, | ||
| ) |
There was a problem hiding this comment.
Start a new span when update_agent switches agents
When an already-entered realtime session is switched via the public update_agent() API, this code updates _current_agent and the model settings but leaves _current_agent_span pointing at the previous agent until cleanup. In that scenario the trace records only the original agent span, so activity after the manual agent switch is attributed to the wrong agent and no span is emitted for the new agent; the handoff path handles this by finishing the old span and starting a new one, but this equivalent switch path does not.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed. update_agent() now mirrors the handoff path exactly: it finishes the outgoing agent span (with reset_current=False), creates and starts a new span for the incoming agent (also mark_as_current=False), and populates span_data.tools / span_data.handoffs from the tuple returned by _get_updated_model_settings_from_agent().
Added test_update_agent_finishes_old_span_and_starts_new_one which asserts two agent spans are emitted (one per agent, both finished) and that the original agent span is closed before the session ends.
Add test_aenter_failure_finishes_span to cover the except BaseException cleanup path in __aenter__ — the final missing entry from the tracing conventions test checklist (Rule 9). Update the _make_agent_span docstring to reflect the current design: callers now populate span_data from _get_updated_model_settings_from_agent tuple return rather than calling get_all_tools() / _get_handoffs() separately.
update_agent() now mirrors the handoff path: finishes the outgoing agent span (reset_current=False), creates and starts a new span for the incoming agent (mark_as_current=False), and populates span metadata from _get_updated_model_settings_from_agent() tuple. Adds test_update_agent_finishes_old_span_and_starts_new_one to verify both agent spans are emitted and the original span is closed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
RealtimeSessionnever created any SDK-level trace spans, so users who wrapped their session inwith trace("...")saw empty traces on the OpenAI dashboard — only the outer trace ID was exported, no agent or tool spans inside it.This PR fixes that by:
agent_spanfor the initial agent inside__aenter__.Runner._cleanup(called fromcloseand__aexit__).run_config["tracing_disabled"] = True— when set,agent_spanis created withdisabled=Trueand no spans are emitted.The span records:
name: the agent's nametools: list ofFunctionToolnames available to the agenthandoffs: names of handoff target agentsTest plan
New file
tests/realtime/test_session_spans.pywith 6 tests covering:tracing_disabled=Truesuppresses all spansTwo existing test fixtures (
mock_agentintest_session.py,fake_agentintest_session_exceptions.py) updated to setnameandtoolson their mock agents, which_make_agent_spannow reads.Issue number
Closes #1845
Checks
make lintandmake format