Skip to content

Streams experts not working with draft model? #72

@hksfho

Description

@hksfho

Without draft model, my Mac Mini M4 (16GB) gets 4tps, with 10GB of RAM occupied in total (including system processes). However, when I enable draft model—even though the model is only 4B (3.5GB)—RAM usage hits 16GB and starts swapping.

Image
benluk@Bens-Mac-mini SwiftLM % ./SwiftLM --model ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit --stream-experts --ssd-prefetch --turbo-kv --temp 0.6 --top-p 0.95 --top-k 20 --max-tokens 65536
[SwiftLM] Loading model: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Loading from local directory: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Enabled Async SSD Streaming on directory: Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, 9GB RAM budget, no swap)
[SwiftLM] SSD Streaming active: Bypassing CPU auto-partitioning (forcing all layers to GPU)
[SwiftLM] Loading LLM (large language model)...
[SwiftLM] Loaded model configuration. Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.xmlFunction)
[SwiftLM] 💾 SSD Expert Streaming enabled (lazy load + layer-sync)
[SwiftLM] 🚀 PAPPS 16-Worker Thread Pool prefetcher enabled!
[SwiftLM] 🧠 Auto-calibration (Wisdom) bypassed for SSD Streaming
[SwiftLM] Model loaded. Starting HTTP server on 127.0.0.1:5413
[SwiftLM] Config: ctx_size=model_default, temp=0.6, top_p=0.95, top_k=20, min_p=disabled, repeat_penalty=disabled, parallel=1, cors=disabled, mem_limit=system_default, auth=disabled, thinking=disabled, ssd_stream=enabled, turbo_kv=enabled
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413
{"engine":"mlx","vision":false,"partition":{"total_layers":40,"total_required_gb":24.800000000000001,"model_weight_gb":20.399999999999999,"ssd_stream":true,"gpu_layers":40,"kv_cache_gb":0.29999999999999999,"cpu_layers":0,"system_ram_gb":17.199999999999999,"estimated_tok_s":6.5,"strategy":"ssd_streaming","overcommit_ratio":1.8799999999999999},"model":"..\/Models\/mlx-community\/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit","event":"ready","port":5413}
2026-04-23T00:00:19+0800 info Hummingbird: [HummingbirdCore] Server started and listening on 127.0.0.1:5413
[Server Debug] Created UserInput with 0 images and 0 audio inputs.
srv  slot_launch: id 0 | prompt=17t | thinking=false | prefilling...
srv  slot update: id 0 | prefill done | n_tokens=17, t=4.34s, 3.9t/s | OS_RAM=3.9GB | MEM_DEMAND=4.1GB | GPU_MEM=3.7GB
srv  generate: id 0 | Thinking Process:

1.  **Analyze the Request:** The user is asking a simple arithmetic question: "What is 2^C
Image
[SwiftLM] Received SIGINT, shutting down gracefully...
zsh: segmentation fault  ./SwiftLM --model  --stream-experts --ssd-prefetch --turbo-kv --temp 0.6  0.9
benluk@Bens-Mac-mini SwiftLM % 
benluk@Bens-Mac-mini SwiftLM % ./SwiftLM --model ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit --draft-model ../Models/mlx-community/Huihui-Qwen3.5-4B-Claude-4.6-Opus-abliterated-4bit --num-draft-tokens 4 --stream-experts --ssd-prefetch --turbo-kv --temp 0.6 --top-p 0.95 --top-k 20 --max-tokens 65536
[SwiftLM] Loading model: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Loading from local directory: ../Models/mlx-community/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Enabled Async SSD Streaming on directory: Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, 9GB RAM budget, no swap)
[SwiftLM] SSD Streaming active: Bypassing CPU auto-partitioning (forcing all layers to GPU)
[SwiftLM] Loading LLM (large language model)...
[SwiftLM] Loaded model configuration. Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.xmlFunction)
[SwiftLM] Loading draft model for speculative decoding: ../Models/mlx-community/Huihui-Qwen3.5-4B-Claude-4.6-Opus-abliterated-4bit
[SwiftLM] Draft model loaded successfully (4 tokens/round)
[SwiftLM] 💾 SSD Expert Streaming enabled (lazy load + layer-sync)
[SwiftLM] 🚀 PAPPS 16-Worker Thread Pool prefetcher enabled!
[SwiftLM] 🧠 Auto-calibration (Wisdom) bypassed for SSD Streaming
[SwiftLM] Model loaded. Starting HTTP server on 127.0.0.1:5413
[SwiftLM] Config: ctx_size=model_default, temp=0.6, top_p=0.95, top_k=20, min_p=disabled, repeat_penalty=disabled, parallel=1, cors=disabled, mem_limit=system_default, auth=disabled, thinking=disabled, ssd_stream=enabled, turbo_kv=enabled
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413
{"engine":"mlx","vision":false,"port":5413,"event":"ready","model":"..\/Models\/mlx-community\/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-4bit","partition":{"overcommit_ratio":1.8799999999999999,"ssd_stream":true,"estimated_tok_s":6.5,"model_weight_gb":20.399999999999999,"gpu_layers":40,"strategy":"ssd_streaming","cpu_layers":0,"total_required_gb":24.800000000000001,"total_layers":40,"kv_cache_gb":0.29999999999999999,"system_ram_gb":17.199999999999999}}
2026-04-23T00:00:40+0800 info Hummingbird: [HummingbirdCore] Server started and listening on 127.0.0.1:5413
[Server Debug] Created UserInput with 0 images and 0 audio inputs.
srv  slot_launch: id 0 | prompt=17t | thinking=false | prefilling...
[SwiftLM] Using speculative decoding (4 draft tokens/round)
^C

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions