From 7a9bdc08e31899b1b0554ba7235a2c0a1fc5f5e1 Mon Sep 17 00:00:00 2001
From: Francesco Berardi <berryfra@gmail.com>
Date: Wed, 24 Jun 2026 18:45:03 +0200
Subject: [PATCH] [GSoC 2026] docs(chatbot): add measured latency numbers

---
 docs/IntelOwl/chatbot_tuning.md | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/docs/IntelOwl/chatbot_tuning.md b/docs/IntelOwl/chatbot_tuning.md
index 9e972728..cda027fc 100644
--- a/docs/IntelOwl/chatbot_tuning.md
+++ b/docs/IntelOwl/chatbot_tuning.md
@@ -25,10 +25,20 @@ Two consequences matter for tuning:
 ## Choosing a model
 
 The default is **`qwen2.5:3b`**. It is chosen on purpose: it is the smallest model that reliably
-picks the right tool and answers from the tool output with **usable latency on a CPU-only deploy**
-(for comparison, a 7B model such as `mistral` was markedly slower on CPU — minutes per agent round,
-often hitting the turn timeout). On stronger
-hardware you can switch to any larger tool-capable Ollama model for better answer quality.
+picks the right tool and answers from the tool output with **usable latency on a CPU-only deploy**.
+Measured end-to-end on a 14-thread CPU (no GPU), warm:
+
+- a plain answer that needs no tool returns in **~5–6 s** (first token in ~0.3 s);
+- a tool-backed answer returns in **~30–50 s** (first token in ~5–13 s, then it streams in).
+
+The first request after the model has been idle (Ollama unloads it after ~5 min) pays a one-time
+**~70 s** cold-load to read the model back into memory; later turns are warm again.
+
+Bigger is not automatically better: in the same measurement a 7B model (`mistral`) **did not emit any
+tool calls on this stack** and replied with invented data instead, so `qwen2.5:3b` is the default for
+**tool-calling reliability**, not only speed. On stronger hardware you can move to any larger
+tool-capable Ollama model — but confirm it actually calls tools (see
+[Validating a model](#validating-a-model-before-rollout)).
 
 Requirements for a replacement model: