fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72) by solderzzc · Pull Request #76 · SharpAI/SwiftLM

solderzzc · 2026-04-23T05:06:14Z

Problem

Closes #72.

On a 16 GB Mac Mini M4, combining --draft-model with --stream-experts causes RAM to spike from ~10 GB → 16 GB and trigger swap, despite the draft model being only 4B (~3.5 GB). The main model alone runs fine within the SSD streaming budget.

Root Causes

🔴 Bug 1 — `draftConfig.lazyLoad` was never set

Line 296 correctly sets modelConfig.lazyLoad = true for the main model under --stream-experts so its weights are mmap'd rather than eagerly paged into unified RAM. draftConfig never received the same flag, so the 4B draft model always loaded all 3.5 GB eagerly into Metal buffers.

🔴 Bug 2 — `Memory.cacheLimit` / `Memory.memoryLimit` were applied after both model loads

The SSD page-cache budget was set inside the partitionPlan switch, which executes after LLMModelFactory.loadContainer() for both the main and draft models. Neither model loaded under a cache cap, allowing MLX to eagerly fill RAM.

🟡 Bug 3 — Draft model footprint not subtracted from `physicalBudget`

The formula totalRAM × 0.85 − 4 GB did not account for the draft model's resident pages (~3.5 GB), leaving the cap 3–4 GB too high even after the ordering was fixed.

Fix

Early SSD memory cap — immediately after ExpertStreamingConfig.shared.activate() (before any loadContainer call), profile the draft model directory, compute its footprint, and apply:

Both the main and draft models now load under the correct page-cache budget.
draftConfig.lazyLoad = true — mirrored from the main model config when --stream-experts is active, so draft weights are mmap'd and only demanded pages enter RAM.
Draft-aware physicalBudget — the same footprint subtraction is applied in the swapAssisted and layerPartitioned strategy branches so the reported budget in logs and /health accurately reflects available headroom.

Expected result on 16 GB M4

Draft model weights are mmap'd — only accessed pages resident
Effective budget ≈ 9.6 GB − 3.5 GB = ~6.1 GB for main-model page cache
Total RAM stays within the SSD streaming envelope; no swap
Reported log line: 📦 Draft model footprint: 3.5GB reserved from SSD budget

… workflow

…experts are combined Fixes #72: on a 16GB Mac Mini M4, adding --draft-model alongside --stream-experts caused RAM to spike to the physical limit and trigger swap, even though the draft model is only a 4B (~3.5GB) model. Root causes and fixes: 1. [Bug] draftConfig.lazyLoad was never set — draft weights were eagerly paged into unified RAM. Fix: set draftConfig.lazyLoad = true when --stream-experts is active, mirroring what already happens for the main model config. 2. [Bug] Memory.cacheLimit / Memory.memoryLimit were applied after both model loads, so neither the main nor draft model loaded under a cache budget. Fix: apply the SSD memory cap immediately after ExpertStreamingConfig.shared.activate() — before any LLMModelFactory.loadContainer() calls — so both models respect the page-cache limit throughout loading. 3. [Bug] physicalBudget did not account for the draft model's resident footprint, leaving the cap 3–4 GB too high. Fix: profile the draft model directory before loading and subtract its weightMemoryGB from physicalBudget in all three affected strategy branches (swapAssisted, layerPartitioned, early cap). A 2 GB floor guard prevents the budget going negative on very constrained machines. Expected result on 16GB M4: - Draft model weights are mmap'd (lazy) — only accessed pages in RAM - Both models load under the ~6GB effective page-cache budget (9.6GB - 3.5GB draft) - No swap; total RAM stays within the SSD streaming budget

…draft model - Extract computeSSDMemoryBudget() from inline formula so it can be unit tested without loading a real model or touching Memory.cacheLimit - Wire all three budget call sites to use the extracted function (no behaviour change) - Add SSDMemoryBudgetTests.swift with 8 tests covering: * Baseline 16 GB / no draft (formula correctness) * Issue #72 regression: 16 GB + 3.5 GB draft → budget reduced by exact footprint * Floor guard: deeply negative raw result clamped to 2 GB * Floor value: confirmed at exactly 2 GB * Default-arg == 0 (no silent reduction without a draft model) * Monotonicity: larger draft → smaller or equal budget * Typical fleet: 24 GB and 64 GB with 3.5 GB draft

Copilot

Pull request overview

Fixes excessive RAM growth when combining --draft-model with --stream-experts by ensuring both main and draft models load under an SSD/page-cache budget and by enabling mmap-style lazy loading for the draft model.

Changes:

Apply an SSD memory cap earlier in startup (before any loadContainer() calls), reserving estimated draft-model footprint from the budget.
Subtract draft-model footprint from the reported/selected physicalBudget in SSD streaming strategy branches.
Mirror lazyLoad = true onto the draft model configuration when --stream-experts is active.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`Sources/SwiftLM/Server.swift`	Adds early MLX cache/memory limit setup, draft-footprint reservation, and draft `lazyLoad` mirroring for SSD streaming.
`.agents/workflows/review-github-pr.md`	Adds a contributor/agent workflow doc for reviewing GitHub issues/PRs in this repo.

Comments suppressed due to low confidence (2)

Sources/SwiftLM/Server.swift:391

Same unit mismatch as above: draftProf.weightMemoryGB is decimal-GB based, but it's converted to bytes using a GiB multiplier (1_073_741_824). Use draftProf.weightFileSizeBytes or a 1e9 multiplier to avoid budget drift.

                    print("[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, \(physicalBudget / (1024*1024*1024))GB RAM budget, no swap)")
                } else {

Sources/SwiftLM/Server.swift:337

The early SSD memory cap is only applied inside if self.streamExperts, let modelDir = modelDirectory { ... }. If the main model ID isn't already downloaded (so resolveModelDirectory returns nil), neither Memory.cacheLimit nor the sentinel Memory.memoryLimit will be set before loadContainer(), which reintroduces the RAM spike risk on first-run downloads. Consider applying the cap whenever --stream-experts is set (even if modelDirectory is nil), and only guard the streaming activation/env-var setup on modelDir.

            // ── Fix #72: Apply SSD memory cap EARLY (before any model loads) ──
            // Both the main model and draft model must load under the budget.
            // The sentinel memoryLimit bypasses MLX eval_impl's spin-wait loop.
            let system = ModelProfiler.systemProfile()
            // Estimate draft model footprint to reserve headroom in the budget.
            let draftFootprintBytes: Int
            if let draftPath = self.draftModel,
               let draftDir = resolveModelDirectory(modelId: draftPath),
               let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
                draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
                print("[SwiftLM] 📦 Draft model footprint: \(String(format: "%.1f", draftProfile.weightMemoryGB))GB reserved from SSD budget")
            } else {
                draftFootprintBytes = 0
            }
            Memory.cacheLimit = computeSSDMemoryBudget(totalRAMBytes: system.totalRAMBytes, draftWeightBytes: draftFootprintBytes)
            Memory.memoryLimit = 200 * 1024 * 1024 * 1024 // 200 GB sentinel
        }
        
        var partitionPlan: PartitionPlan?
        if let modelDir = modelDirectory,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T05:10:00Z

+            if let draftPath = self.draftModel,
+               let draftDir = resolveModelDirectory(modelId: draftPath),
+               let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
+                draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)


ModelProfile.weightMemoryGB is computed using decimal GB (weightFileSizeBytes / 1e9), but this converts back to bytes using a GiB multiplier (1_073_741_824). This skews the reserved budget (~7%) and can under/over-cap memory. Prefer using draftProfile.weightFileSizeBytes directly, or multiply weightMemoryGB by 1e9 to stay consistent.

Suggested change

draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)

draftFootprintBytes = draftProfile.weightFileSizeBytes

Copilot · 2026-04-23T05:10:00Z

+            let draftFootprintBytes: Int
+            if let draftPath = self.draftModel,
+               let draftDir = resolveModelDirectory(modelId: draftPath),
+               let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
+                draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
+                print("[SwiftLM] 📦 Draft model footprint: \(String(format: "%.1f", draftProfile.weightMemoryGB))GB reserved from SSD budget")
+            } else {
+                draftFootprintBytes = 0
+            }


ModelProfiler.profile(...) walks the model directory to sum weight sizes. This PR calls it multiple times for the draft model (early cap + each strategy branch), which repeats filesystem enumeration and can slow startup. Consider computing draftFootprintBytes once (and perhaps the resolved draftDir) and reusing it for the early cap and later physicalBudget calculations/logging.

Copilot · 2026-04-23T05:10:00Z

+                        draftReserveBytes = 0
+                    }


Same unit mismatch as above: draftProf.weightMemoryGB is decimal-GB based, but it's converted to bytes using a GiB multiplier (1_073_741_824). Use draftProf.weightFileSizeBytes or a 1e9 multiplier to avoid budget drift.

Copilot · 2026-04-23T05:10:01Z

+- `gh` CLI path on macOS: **`/opt/homebrew/bin/gh`**
+  ```bash
+  export PATH="/opt/homebrew/bin:$PATH"
+  which gh  # → /opt/homebrew/bin/gh
+  ```
+- `gh` must be authenticated (`gh auth status`)
+- Working directory: `/Users/simba/workspace/mlx-server`


This workflow hard-codes machine-specific paths (e.g. /opt/homebrew/bin/gh, /Users/simba/workspace/mlx-server). That makes the workflow non-portable for other contributors and CI environments. Consider replacing these with placeholders and/or environment-variable based guidance (e.g. "ensure gh is on PATH" and "run from your local repo checkout").

Suggested change

- `gh` CLI path on macOS: **`/opt/homebrew/bin/gh`**

```bash

export PATH="/opt/homebrew/bin:$PATH"

which gh # → /opt/homebrew/bin/gh

```

- `gh` must be authenticated (`gh auth status`)

- Working directory: `/Users/simba/workspace/mlx-server`

- Ensure the `gh` CLI is installed and available on your `PATH`

```bash

which gh

gh --version

gh must be authenticated (gh auth status)

Run this workflow from your local checkout of the SharpAI/SwiftLM repository

Two correctness issues flagged in inline review: 1. GiB/GB unit mismatch — weightMemoryGB is computed as bytes/1e9 (decimal GB), but was multiplied back to bytes using 1_073_741_824 (GiB), causing ~7% budget drift. Fix: use draftProfile.weightFileSizeBytes directly (exact bytes, no conversion needed). 2. Repeated ModelProfiler.profile() filesystem walks — the draft model directory was enumerated once in the early cap block and again in each strategy branch (swapAssisted, layerPartitioned). Fix: compute draftFootprintBytes once before the streamExperts block and reuse it everywhere. Also addresses a third Copilot comment: the early SSD cap was only applied when modelDirectory != nil, so first-run downloads were unprotected. Now the cap is applied whenever --stream-experts is set, even if the model isn't cached yet (modelling via the else-if branch). All 8 SSDMemoryBudgetTests still pass.

solderzzc · 2026-04-23T06:42:16Z

Addressed all Copilot inline review comments in commit 9b0a31c:

🔴 GiB/GB unit mismatch (lines 327, 366)
weightMemoryGB is bytes / 1e9 (decimal GB), but we were multiplying back by 1_073_741_824 (GiB multiplier) — ~7% budget drift. Fixed by using draftProfile.weightFileSizeBytes directly (exact bytes, no conversion).

🟡 Repeated ModelProfiler.profile() filesystem scans (line 331)
The draft model directory was enumerated separately in the early-cap block and again in each strategy branch. Fixed by computing draftFootprintBytes once before the streamExperts block and reusing it in all three call sites.

🟡 Early cap not applied on first-run downloads (line 337)
The SSD cap was inside if self.streamExperts, let modelDir = modelDirectory, so it only fired when the model was already cached. Fixed by adding an else if self.streamExperts branch that applies the cap even when the model directory is nil (download scenario).

🟢 Workflow hardcoded paths — deferred as agreed (contributor docs, not a runtime bug).

github-actions Bot added 3 commits April 22, 2026 21:57

chore(agents): add review-github-pr workflow skill

1005d3e

chore(agents): document /opt/homebrew/bin/gh path in review-github-pr…

975db48

… workflow

Copilot AI review requested due to automatic review settings April 23, 2026 05:06

Copilot started reviewing on behalf of solderzzc April 23, 2026 05:06 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

solderzzc mentioned this pull request Apr 23, 2026

Streams experts not working with draft model? #72

Closed

solderzzc merged commit 336c8a8 into main Apr 23, 2026
9 checks passed

solderzzc deleted the fix/issue-72-draft-model-ssd-ram branch April 23, 2026 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72)#76

fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72)#76
solderzzc merged 5 commits intomainfrom
fix/issue-72-draft-model-ssd-ram

solderzzc commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

solderzzc commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
	draftFootprintBytes = draftProfile.weightFileSizeBytes

Conversation

solderzzc commented Apr 23, 2026

Problem

Root Causes

🔴 Bug 1 — draftConfig.lazyLoad was never set

🔴 Bug 2 — Memory.cacheLimit / Memory.memoryLimit were applied after both model loads

🟡 Bug 3 — Draft model footprint not subtracted from physicalBudget

Fix

Expected result on 16 GB M4

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

solderzzc commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔴 Bug 1 — `draftConfig.lazyLoad` was never set

🔴 Bug 2 — `Memory.cacheLimit` / `Memory.memoryLimit` were applied after both model loads

🟡 Bug 3 — Draft model footprint not subtracted from `physicalBudget`