From 7a9bdc08e31899b1b0554ba7235a2c0a1fc5f5e1 Mon Sep 17 00:00:00 2001 From: Francesco Berardi Date: Wed, 24 Jun 2026 18:45:03 +0200 Subject: [PATCH] [GSoC 2026] docs(chatbot): add measured latency numbers --- docs/IntelOwl/chatbot_tuning.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/docs/IntelOwl/chatbot_tuning.md b/docs/IntelOwl/chatbot_tuning.md index 9e972728..cda027fc 100644 --- a/docs/IntelOwl/chatbot_tuning.md +++ b/docs/IntelOwl/chatbot_tuning.md @@ -25,10 +25,20 @@ Two consequences matter for tuning: ## Choosing a model The default is **`qwen2.5:3b`**. It is chosen on purpose: it is the smallest model that reliably -picks the right tool and answers from the tool output with **usable latency on a CPU-only deploy** -(for comparison, a 7B model such as `mistral` was markedly slower on CPU — minutes per agent round, -often hitting the turn timeout). On stronger -hardware you can switch to any larger tool-capable Ollama model for better answer quality. +picks the right tool and answers from the tool output with **usable latency on a CPU-only deploy**. +Measured end-to-end on a 14-thread CPU (no GPU), warm: + +- a plain answer that needs no tool returns in **~5–6 s** (first token in ~0.3 s); +- a tool-backed answer returns in **~30–50 s** (first token in ~5–13 s, then it streams in). + +The first request after the model has been idle (Ollama unloads it after ~5 min) pays a one-time +**~70 s** cold-load to read the model back into memory; later turns are warm again. + +Bigger is not automatically better: in the same measurement a 7B model (`mistral`) **did not emit any +tool calls on this stack** and replied with invented data instead, so `qwen2.5:3b` is the default for +**tool-calling reliability**, not only speed. On stronger hardware you can move to any larger +tool-capable Ollama model — but confirm it actually calls tools (see +[Validating a model](#validating-a-model-before-rollout)). Requirements for a replacement model: