Skip to content

fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72)#76

Merged
solderzzc merged 5 commits intomainfrom
fix/issue-72-draft-model-ssd-ram
Apr 23, 2026
Merged

fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72)#76
solderzzc merged 5 commits intomainfrom
fix/issue-72-draft-model-ssd-ram

Conversation

@solderzzc
Copy link
Copy Markdown
Member

Problem

Closes #72.

On a 16 GB Mac Mini M4, combining --draft-model with --stream-experts causes RAM to spike from ~10 GB → 16 GB and trigger swap, despite the draft model being only 4B (~3.5 GB). The main model alone runs fine within the SSD streaming budget.

Root Causes

🔴 Bug 1 — draftConfig.lazyLoad was never set

Line 296 correctly sets modelConfig.lazyLoad = true for the main model under --stream-experts so its weights are mmap'd rather than eagerly paged into unified RAM. draftConfig never received the same flag, so the 4B draft model always loaded all 3.5 GB eagerly into Metal buffers.

🔴 Bug 2 — Memory.cacheLimit / Memory.memoryLimit were applied after both model loads

The SSD page-cache budget was set inside the partitionPlan switch, which executes after LLMModelFactory.loadContainer() for both the main and draft models. Neither model loaded under a cache cap, allowing MLX to eagerly fill RAM.

🟡 Bug 3 — Draft model footprint not subtracted from physicalBudget

The formula totalRAM × 0.85 − 4 GB did not account for the draft model's resident pages (~3.5 GB), leaving the cap 3–4 GB too high even after the ordering was fixed.

Fix

  1. Early SSD memory cap — immediately after ExpertStreamingConfig.shared.activate() (before any loadContainer call), profile the draft model directory, compute its footprint, and apply:

    Both the main and draft models now load under the correct page-cache budget.

  2. draftConfig.lazyLoad = true — mirrored from the main model config when --stream-experts is active, so draft weights are mmap'd and only demanded pages enter RAM.

  3. Draft-aware physicalBudget — the same footprint subtraction is applied in the swapAssisted and layerPartitioned strategy branches so the reported budget in logs and /health accurately reflects available headroom.

Expected result on 16 GB M4

  • Draft model weights are mmap'd — only accessed pages resident
  • Effective budget ≈ 9.6 GB − 3.5 GB = ~6.1 GB for main-model page cache
  • Total RAM stays within the SSD streaming envelope; no swap
  • Reported log line: 📦 Draft model footprint: 3.5GB reserved from SSD budget

…experts are combined

Fixes #72: on a 16GB Mac Mini M4, adding --draft-model alongside --stream-experts
caused RAM to spike to the physical limit and trigger swap, even though the draft
model is only a 4B (~3.5GB) model.

Root causes and fixes:
1. [Bug] draftConfig.lazyLoad was never set — draft weights were eagerly paged into
   unified RAM. Fix: set draftConfig.lazyLoad = true when --stream-experts is active,
   mirroring what already happens for the main model config.

2. [Bug] Memory.cacheLimit / Memory.memoryLimit were applied after both model loads,
   so neither the main nor draft model loaded under a cache budget. Fix: apply the
   SSD memory cap immediately after ExpertStreamingConfig.shared.activate() — before
   any LLMModelFactory.loadContainer() calls — so both models respect the page-cache
   limit throughout loading.

3. [Bug] physicalBudget did not account for the draft model's resident footprint,
   leaving the cap 3–4 GB too high. Fix: profile the draft model directory before
   loading and subtract its weightMemoryGB from physicalBudget in all three affected
   strategy branches (swapAssisted, layerPartitioned, early cap). A 2 GB floor guard
   prevents the budget going negative on very constrained machines.

Expected result on 16GB M4:
- Draft model weights are mmap'd (lazy) — only accessed pages in RAM
- Both models load under the ~6GB effective page-cache budget (9.6GB - 3.5GB draft)
- No swap; total RAM stays within the SSD streaming budget
Copilot AI review requested due to automatic review settings April 23, 2026 05:06
…draft model

- Extract computeSSDMemoryBudget() from inline formula so it can be unit tested
  without loading a real model or touching Memory.cacheLimit
- Wire all three budget call sites to use the extracted function (no behaviour change)
- Add SSDMemoryBudgetTests.swift with 8 tests covering:
    * Baseline 16 GB / no draft (formula correctness)
    * Issue #72 regression: 16 GB + 3.5 GB draft → budget reduced by exact footprint
    * Floor guard: deeply negative raw result clamped to 2 GB
    * Floor value: confirmed at exactly 2 GB
    * Default-arg == 0 (no silent reduction without a draft model)
    * Monotonicity: larger draft → smaller or equal budget
    * Typical fleet: 24 GB and 64 GB with 3.5 GB draft
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes excessive RAM growth when combining --draft-model with --stream-experts by ensuring both main and draft models load under an SSD/page-cache budget and by enabling mmap-style lazy loading for the draft model.

Changes:

  • Apply an SSD memory cap earlier in startup (before any loadContainer() calls), reserving estimated draft-model footprint from the budget.
  • Subtract draft-model footprint from the reported/selected physicalBudget in SSD streaming strategy branches.
  • Mirror lazyLoad = true onto the draft model configuration when --stream-experts is active.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
Sources/SwiftLM/Server.swift Adds early MLX cache/memory limit setup, draft-footprint reservation, and draft lazyLoad mirroring for SSD streaming.
.agents/workflows/review-github-pr.md Adds a contributor/agent workflow doc for reviewing GitHub issues/PRs in this repo.
Comments suppressed due to low confidence (2)

Sources/SwiftLM/Server.swift:391

  • Same unit mismatch as above: draftProf.weightMemoryGB is decimal-GB based, but it's converted to bytes using a GiB multiplier (1_073_741_824). Use draftProf.weightFileSizeBytes or a 1e9 multiplier to avoid budget drift.
                    print("[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, \(physicalBudget / (1024*1024*1024))GB RAM budget, no swap)")
                } else {

Sources/SwiftLM/Server.swift:337

  • The early SSD memory cap is only applied inside if self.streamExperts, let modelDir = modelDirectory { ... }. If the main model ID isn't already downloaded (so resolveModelDirectory returns nil), neither Memory.cacheLimit nor the sentinel Memory.memoryLimit will be set before loadContainer(), which reintroduces the RAM spike risk on first-run downloads. Consider applying the cap whenever --stream-experts is set (even if modelDirectory is nil), and only guard the streaming activation/env-var setup on modelDir.
            // ── Fix #72: Apply SSD memory cap EARLY (before any model loads) ──
            // Both the main model and draft model must load under the budget.
            // The sentinel memoryLimit bypasses MLX eval_impl's spin-wait loop.
            let system = ModelProfiler.systemProfile()
            // Estimate draft model footprint to reserve headroom in the budget.
            let draftFootprintBytes: Int
            if let draftPath = self.draftModel,
               let draftDir = resolveModelDirectory(modelId: draftPath),
               let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
                draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
                print("[SwiftLM] 📦 Draft model footprint: \(String(format: "%.1f", draftProfile.weightMemoryGB))GB reserved from SSD budget")
            } else {
                draftFootprintBytes = 0
            }
            Memory.cacheLimit = computeSSDMemoryBudget(totalRAMBytes: system.totalRAMBytes, draftWeightBytes: draftFootprintBytes)
            Memory.memoryLimit = 200 * 1024 * 1024 * 1024 // 200 GB sentinel
        }
        
        var partitionPlan: PartitionPlan?
        if let modelDir = modelDirectory,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Sources/SwiftLM/Server.swift Outdated
if let draftPath = self.draftModel,
let draftDir = resolveModelDirectory(modelId: draftPath),
let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelProfile.weightMemoryGB is computed using decimal GB (weightFileSizeBytes / 1e9), but this converts back to bytes using a GiB multiplier (1_073_741_824). This skews the reserved budget (~7%) and can under/over-cap memory. Prefer using draftProfile.weightFileSizeBytes directly, or multiply weightMemoryGB by 1e9 to stay consistent.

Suggested change
draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
draftFootprintBytes = draftProfile.weightFileSizeBytes

Copilot uses AI. Check for mistakes.
Comment thread Sources/SwiftLM/Server.swift Outdated
Comment on lines +323 to +331
let draftFootprintBytes: Int
if let draftPath = self.draftModel,
let draftDir = resolveModelDirectory(modelId: draftPath),
let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
print("[SwiftLM] 📦 Draft model footprint: \(String(format: "%.1f", draftProfile.weightMemoryGB))GB reserved from SSD budget")
} else {
draftFootprintBytes = 0
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelProfiler.profile(...) walks the model directory to sum weight sizes. This PR calls it multiple times for the draft model (early cap + each strategy branch), which repeats filesystem enumeration and can slow startup. Consider computing draftFootprintBytes once (and perhaps the resolved draftDir) and reusing it for the early cap and later physicalBudget calculations/logging.

Copilot uses AI. Check for mistakes.
Comment thread Sources/SwiftLM/Server.swift Outdated
Comment on lines +365 to +366
draftReserveBytes = 0
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same unit mismatch as above: draftProf.weightMemoryGB is decimal-GB based, but it's converted to bytes using a GiB multiplier (1_073_741_824). Use draftProf.weightFileSizeBytes or a 1e9 multiplier to avoid budget drift.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +21
- `gh` CLI path on macOS: **`/opt/homebrew/bin/gh`**
```bash
export PATH="/opt/homebrew/bin:$PATH"
which gh # → /opt/homebrew/bin/gh
```
- `gh` must be authenticated (`gh auth status`)
- Working directory: `/Users/simba/workspace/mlx-server`
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow hard-codes machine-specific paths (e.g. /opt/homebrew/bin/gh, /Users/simba/workspace/mlx-server). That makes the workflow non-portable for other contributors and CI environments. Consider replacing these with placeholders and/or environment-variable based guidance (e.g. "ensure gh is on PATH" and "run from your local repo checkout").

Suggested change
- `gh` CLI path on macOS: **`/opt/homebrew/bin/gh`**
```bash
export PATH="/opt/homebrew/bin:$PATH"
which gh # → /opt/homebrew/bin/gh
```
- `gh` must be authenticated (`gh auth status`)
- Working directory: `/Users/simba/workspace/mlx-server`
- Ensure the `gh` CLI is installed and available on your `PATH`
```bash
which gh
gh --version
  • gh must be authenticated (gh auth status)
  • Run this workflow from your local checkout of the SharpAI/SwiftLM repository

Copilot uses AI. Check for mistakes.
Two correctness issues flagged in inline review:

1. GiB/GB unit mismatch — weightMemoryGB is computed as bytes/1e9 (decimal GB),
   but was multiplied back to bytes using 1_073_741_824 (GiB), causing ~7% budget
   drift. Fix: use draftProfile.weightFileSizeBytes directly (exact bytes, no
   conversion needed).

2. Repeated ModelProfiler.profile() filesystem walks — the draft model directory
   was enumerated once in the early cap block and again in each strategy branch
   (swapAssisted, layerPartitioned). Fix: compute draftFootprintBytes once before
   the streamExperts block and reuse it everywhere.

Also addresses a third Copilot comment: the early SSD cap was only applied when
modelDirectory != nil, so first-run downloads were unprotected. Now the cap is
applied whenever --stream-experts is set, even if the model isn't cached yet
(modelling via the else-if branch).

All 8 SSDMemoryBudgetTests still pass.
@solderzzc
Copy link
Copy Markdown
Member Author

Addressed all Copilot inline review comments in commit 9b0a31c:

🔴 GiB/GB unit mismatch (lines 327, 366)
weightMemoryGB is bytes / 1e9 (decimal GB), but we were multiplying back by 1_073_741_824 (GiB multiplier) — ~7% budget drift. Fixed by using draftProfile.weightFileSizeBytes directly (exact bytes, no conversion).

🟡 Repeated ModelProfiler.profile() filesystem scans (line 331)
The draft model directory was enumerated separately in the early-cap block and again in each strategy branch. Fixed by computing draftFootprintBytes once before the streamExperts block and reusing it in all three call sites.

🟡 Early cap not applied on first-run downloads (line 337)
The SSD cap was inside if self.streamExperts, let modelDir = modelDirectory, so it only fired when the model was already cached. Fixed by adding an else if self.streamExperts branch that applies the cap even when the model directory is nil (download scenario).

🟢 Workflow hardcoded paths — deferred as agreed (contributor docs, not a runtime bug).

@solderzzc solderzzc merged commit 336c8a8 into main Apr 23, 2026
9 checks passed
@solderzzc solderzzc deleted the fix/issue-72-draft-model-ssd-ram branch April 23, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streams experts not working with draft model?

2 participants