docs(openai): document empty response from thinking-mode models#1062
Conversation
b37b587 to
601a224
Compare
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
ajbozarth
left a comment
There was a problem hiding this comment.
A few notes inline — nothing on the error-path span cleanup before the raise since that's getting rewritten anyway.
42e83d9 to
5e01a8d
Compare
jakelorocco
left a comment
There was a problem hiding this comment.
I think I disagree with this change. Users should be able to enable thinking and we should handle that gracefully. If a model only produces thinking tokens, that is the reality of the response we received.
We can maybe log a warning or have better documentation that it's a possibility; but I think this is expected behavior.
| 'model_options={"extra_body": {"chat_template_kwargs": ' | ||
| '{"enable_thinking": False}}}.' |
There was a problem hiding this comment.
If we go this route, we should advertise the Mellea specific ModelOption.THINKING model option instead.
There was a problem hiding this comment.
While I was doing a re-review Claude made the following observation on this that's worth noting:
Jake's right that ModelOption.THINKING is the canonical Mellea-side knob, but there's a wrinkle worth flagging before we change the message.
The OpenAI backend currently maps ModelOption.THINKING to reasoning_effort (openai.py:672-686). For OpenAI proper / o-series that's correct, but for Qwen3 via vLLM the lever to disable thinking is chat_template_kwargs.enable_thinking, which is a different parameter — reasoning_effort=False would just get ignored (or rejected) by vLLM. So if we swap the error message to advertise model_options={ModelOption.THINKING: False} today, it'd send users at the exact failure mode this PR is trying to surface to a workaround that doesn't actually work for them.
Two options:
- Land this PR as-is with the raw extra_body form (which is what works on vLLM/Qwen3 today), and open a follow-up to broaden ModelOption.THINKING in the OpenAI backend so THINKING=False also emits chat_template_kwargs.enable_thinking=False for compatible providers. Then update the message.
- Do that broadening in this PR before advertising the abstraction.
I'd lean toward (1) to keep the fix scoped — the error message is already a big improvement over the silent empty string, and we shouldn't block it on a semantics change to THINKING.
There was a problem hiding this comment.
Now docs only — the runtime change is being scrapped (see my comment above). The advertised workaround moves into the integration docs rather than an error message, so the ModelOption.THINKING vs extra_body question does not apply here any more.
Closes generative-computing#1060. When a thinking-mode model (e.g. Qwen3 via vLLM with --reasoning-parser) emits only reasoning tokens, the OpenAI backend faithfully returns content=None — surfacing as result.value == "" with non-zero completion_tokens. The reasoning trace is preserved on ModelOutputThunk._thinking. This is expected behaviour, not a backend bug, so document the symptom in the OpenAI integration troubleshooting section: how to diagnose, where the reasoning content lives, and how to disable thinking via chat_template_kwargs for vLLM/Qwen3. Assisted-by: Claude Code
|
I'm thinking now we should not make this change. I hit it when trying qwen as I didn't appreciate the different behaviour. (We did make another change in the backend relating to the response field). I agree with Jake in that we can't change behaviour - if no real response is returned, well that's the response - and why we have validators etc. Even a warning is potentially noise so probably not right either. I therefore am scrapping the code change, and instead just offering a small docs tweak |
5e01a8d to
ae70f9a
Compare
jakelorocco
left a comment
There was a problem hiding this comment.
lgtm; opened #1093 to handle getting ModelOption.Thinking to work in these edge cases as well
801bbfd
Misc PR
Type of PR
Description
Document the empty-
valuesymptom that thinking-mode models exhibit on the OpenAI backend, rather than enforcing it with aRuntimeErrorinpost_processing().A model returning
content=Nonewithfinish_reason=stopand non-zerocompletion_tokensis the literal response from the API — the reasoning content is preserved onModelOutputThunk._thinking. Raising would break legitimate thinking-mode flows and bypass Mellea's sampling and validator machinery, so the right fix is discoverability, not enforcement.Adds an Empty
valuefrom a thinking-mode model subsection todocs/docs/integrations/openai.mdcovering:result.value,result.generation.usage,result._thinkingchat_template_kwargs.enable_thinking=Falseworkaround for callers who did not intend thinking moderesult._thinkingfor callers who didTesting
Docs-only change — pre-commit (including
markdownlint) passes locally.Attribution
Assisted-by: Claude Code