Streams experts not working with draft model?

Without draft model, my Mac Mini M4 (16GB) gets 4tps, with 10GB of RAM occupied in total (including system processes). However, when I enable draft model—even though the model is only 4B (3.5GB)—RAM usage hits 16GB and starts swapping.

<img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/36933c53-56d8-4b69-ba3d-5ab15e737752" />

```
benluk@Bens-Mac-mini SwiftLM % ./SwiftLM --model ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit --stream-experts --ssd-prefetch --turbo-kv --temp 0.6 --top-p 0.95 --top-k 20 --max-tokens 65536
[SwiftLM] Loading model: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Loading from local directory: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Enabled Async SSD Streaming on directory: Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, 9GB RAM budget, no swap)
[SwiftLM] SSD Streaming active: Bypassing CPU auto-partitioning (forcing all layers to GPU)
[SwiftLM] Loading LLM (large language model)...
[SwiftLM] Loaded model configuration. Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.xmlFunction)
[SwiftLM] 💾 SSD Expert Streaming enabled (lazy load + layer-sync)
[SwiftLM] 🚀 PAPPS 16-Worker Thread Pool prefetcher enabled!
[SwiftLM] 🧠 Auto-calibration (Wisdom) bypassed for SSD Streaming
[SwiftLM] Model loaded. Starting HTTP server on 127.0.0.1:5413
[SwiftLM] Config: ctx_size=model_default, temp=0.6, top_p=0.95, top_k=20, min_p=disabled, repeat_penalty=disabled, parallel=1, cors=disabled, mem_limit=system_default, auth=disabled, thinking=disabled, ssd_stream=enabled, turbo_kv=enabled
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413
{"engine":"mlx","vision":false,"partition":{"total_layers":40,"total_required_gb":24.800000000000001,"model_weight_gb":20.399999999999999,"ssd_stream":true,"gpu_layers":40,"kv_cache_gb":0.29999999999999999,"cpu_layers":0,"system_ram_gb":17.199999999999999,"estimated_tok_s":6.5,"strategy":"ssd_streaming","overcommit_ratio":1.8799999999999999},"model":"..\/Models\/mlx-community\/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit","event":"ready","port":5413}
2026-04-23T00:00:19+0800 info Hummingbird: [HummingbirdCore] Server started and listening on 127.0.0.1:5413
[Server Debug] Created UserInput with 0 images and 0 audio inputs.
srv  slot_launch: id 0 | prompt=17t | thinking=false | prefilling...
srv  slot update: id 0 | prefill done | n_tokens=17, t=4.34s, 3.9t/s | OS_RAM=3.9GB | MEM_DEMAND=4.1GB | GPU_MEM=3.7GB
srv  generate: id 0 | Thinking Process:

1.  **Analyze the Request:** The user is asking a simple arithmetic question: "What is 2^C
```


<img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/d2670e85-e13f-4fcf-8823-76d7fa44332a" />

```
[SwiftLM] Received SIGINT, shutting down gracefully...
zsh: segmentation fault  ./SwiftLM --model  --stream-experts --ssd-prefetch --turbo-kv --temp 0.6  0.9
benluk@Bens-Mac-mini SwiftLM % 
benluk@Bens-Mac-mini SwiftLM % ./SwiftLM --model ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit --draft-model ../Models/mlx-community/Huihui-Qwen3.5-4B-Claude-4.6-Opus-abliterated-4bit --num-draft-tokens 4 --stream-experts --ssd-prefetch --turbo-kv --temp 0.6 --top-p 0.95 --top-k 20 --max-tokens 65536
[SwiftLM] Loading model: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Loading from local directory: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Enabled Async SSD Streaming on directory: Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, 9GB RAM budget, no swap)
[SwiftLM] SSD Streaming active: Bypassing CPU auto-partitioning (forcing all layers to GPU)
[SwiftLM] Loading LLM (large language model)...
[SwiftLM] Loaded model configuration. Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.xmlFunction)
[SwiftLM] Loading draft model for speculative decoding: ../Models/mlx-community/Huihui-Qwen3.5-4B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Draft model loaded successfully (4 tokens/round)
[SwiftLM] 💾 SSD Expert Streaming enabled (lazy load + layer-sync)
[SwiftLM] 🚀 PAPPS 16-Worker Thread Pool prefetcher enabled!
[SwiftLM] 🧠 Auto-calibration (Wisdom) bypassed for SSD Streaming
[SwiftLM] Model loaded. Starting HTTP server on 127.0.0.1:5413
[SwiftLM] Config: ctx_size=model_default, temp=0.6, top_p=0.95, top_k=20, min_p=disabled, repeat_penalty=disabled, parallel=1, cors=disabled, mem_limit=system_default, auth=disabled, thinking=disabled, ssd_stream=enabled, turbo_kv=enabled
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413
{"engine":"mlx","vision":false,"port":5413,"event":"ready","model":"..\/Models\/mlx-community\/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit","partition":{"overcommit_ratio":1.8799999999999999,"ssd_stream":true,"estimated_tok_s":6.5,"model_weight_gb":20.399999999999999,"gpu_layers":40,"strategy":"ssd_streaming","cpu_layers":0,"total_required_gb":24.800000000000001,"total_layers":40,"kv_cache_gb":0.29999999999999999,"system_ram_gb":17.199999999999999}}
2026-04-23T00:00:40+0800 info Hummingbird: [HummingbirdCore] Server started and listening on 127.0.0.1:5413
[Server Debug] Created UserInput with 0 images and 0 audio inputs.
srv  slot_launch: id 0 | prompt=17t | thinking=false | prefilling...
[SwiftLM] Using speculative decoding (4 draft tokens/round)
^C
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streams experts not working with draft model? #72

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Streams experts not working with draft model? #72

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions