review (greptile): swallow metric-emission errors in except handler

x · x · commit 1935aa97e080 · 2026-05-08T00:27:49.000-04:00
If get_llm_metrics().requests.add() raises (misbehaving exporter, OTel
SDK bug, network blip mid-export), the original LLM exception would be
shadowed by the metric error. Callers — retry logic, circuit breakers,
the OpenAI Agents Temporal plugin's retryable/non-retryable
classifier — inspect the typed exception (RateLimitError,
APITimeoutError, etc.) and would silently break with an unexpected
OTel exception in its place.

Wrap the .add() call in a bare try/except so the metric is best-effort
and the typed LLM exception always propagates.
diff --git a/src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py b/src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py
@@ -1072,10 +1072,16 @@ async def get_response(
                 logger.error(f"Error using Responses API: {e}")
                 # Emit a request-counter event so 429s, 5xxs, timeouts, etc. are
                 # observable on the SDK side. Status histograms / token counters
-                # only fire on successful completion above.
-                get_llm_metrics().requests.add(
-                    1, {"model": self.model_name, "status": classify_status(e)}
-                )
+                # only fire on successful completion above. Wrapped in a bare
+                # try/except so a misbehaving exporter can't shadow the original
+                # LLM exception — callers (retry logic, circuit breakers) need
+                # to see the typed RateLimitError / APITimeoutError / etc.
+                try:
+                    get_llm_metrics().requests.add(
+                        1, {"model": self.model_name, "status": classify_status(e)}
+                    )
+                except Exception:
+                    pass
                 raise
 
     # The _get_response_with_responses_api method has been merged into get_response above