Conversation
…experts are combined Fixes #72: on a 16GB Mac Mini M4, adding --draft-model alongside --stream-experts caused RAM to spike to the physical limit and trigger swap, even though the draft model is only a 4B (~3.5GB) model. Root causes and fixes: 1. [Bug] draftConfig.lazyLoad was never set — draft weights were eagerly paged into unified RAM. Fix: set draftConfig.lazyLoad = true when --stream-experts is active, mirroring what already happens for the main model config. 2. [Bug] Memory.cacheLimit / Memory.memoryLimit were applied after both model loads, so neither the main nor draft model loaded under a cache budget. Fix: apply the SSD memory cap immediately after ExpertStreamingConfig.shared.activate() — before any LLMModelFactory.loadContainer() calls — so both models respect the page-cache limit throughout loading. 3. [Bug] physicalBudget did not account for the draft model's resident footprint, leaving the cap 3–4 GB too high. Fix: profile the draft model directory before loading and subtract its weightMemoryGB from physicalBudget in all three affected strategy branches (swapAssisted, layerPartitioned, early cap). A 2 GB floor guard prevents the budget going negative on very constrained machines. Expected result on 16GB M4: - Draft model weights are mmap'd (lazy) — only accessed pages in RAM - Both models load under the ~6GB effective page-cache budget (9.6GB - 3.5GB draft) - No swap; total RAM stays within the SSD streaming budget
…draft model
- Extract computeSSDMemoryBudget() from inline formula so it can be unit tested
without loading a real model or touching Memory.cacheLimit
- Wire all three budget call sites to use the extracted function (no behaviour change)
- Add SSDMemoryBudgetTests.swift with 8 tests covering:
* Baseline 16 GB / no draft (formula correctness)
* Issue #72 regression: 16 GB + 3.5 GB draft → budget reduced by exact footprint
* Floor guard: deeply negative raw result clamped to 2 GB
* Floor value: confirmed at exactly 2 GB
* Default-arg == 0 (no silent reduction without a draft model)
* Monotonicity: larger draft → smaller or equal budget
* Typical fleet: 24 GB and 64 GB with 3.5 GB draft
There was a problem hiding this comment.
Pull request overview
Fixes excessive RAM growth when combining --draft-model with --stream-experts by ensuring both main and draft models load under an SSD/page-cache budget and by enabling mmap-style lazy loading for the draft model.
Changes:
- Apply an SSD memory cap earlier in startup (before any
loadContainer()calls), reserving estimated draft-model footprint from the budget. - Subtract draft-model footprint from the reported/selected
physicalBudgetin SSD streaming strategy branches. - Mirror
lazyLoad = trueonto the draft model configuration when--stream-expertsis active.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
Sources/SwiftLM/Server.swift |
Adds early MLX cache/memory limit setup, draft-footprint reservation, and draft lazyLoad mirroring for SSD streaming. |
.agents/workflows/review-github-pr.md |
Adds a contributor/agent workflow doc for reviewing GitHub issues/PRs in this repo. |
Comments suppressed due to low confidence (2)
Sources/SwiftLM/Server.swift:391
- Same unit mismatch as above:
draftProf.weightMemoryGBis decimal-GB based, but it's converted to bytes using a GiB multiplier (1_073_741_824). UsedraftProf.weightFileSizeBytesor a1e9multiplier to avoid budget drift.
print("[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, \(physicalBudget / (1024*1024*1024))GB RAM budget, no swap)")
} else {
Sources/SwiftLM/Server.swift:337
- The early SSD memory cap is only applied inside
if self.streamExperts, let modelDir = modelDirectory { ... }. If the main model ID isn't already downloaded (soresolveModelDirectoryreturns nil), neitherMemory.cacheLimitnor the sentinelMemory.memoryLimitwill be set beforeloadContainer(), which reintroduces the RAM spike risk on first-run downloads. Consider applying the cap whenever--stream-expertsis set (even ifmodelDirectoryis nil), and only guard the streaming activation/env-var setup onmodelDir.
// ── Fix #72: Apply SSD memory cap EARLY (before any model loads) ──
// Both the main model and draft model must load under the budget.
// The sentinel memoryLimit bypasses MLX eval_impl's spin-wait loop.
let system = ModelProfiler.systemProfile()
// Estimate draft model footprint to reserve headroom in the budget.
let draftFootprintBytes: Int
if let draftPath = self.draftModel,
let draftDir = resolveModelDirectory(modelId: draftPath),
let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) {
draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824)
print("[SwiftLM] 📦 Draft model footprint: \(String(format: "%.1f", draftProfile.weightMemoryGB))GB reserved from SSD budget")
} else {
draftFootprintBytes = 0
}
Memory.cacheLimit = computeSSDMemoryBudget(totalRAMBytes: system.totalRAMBytes, draftWeightBytes: draftFootprintBytes)
Memory.memoryLimit = 200 * 1024 * 1024 * 1024 // 200 GB sentinel
}
var partitionPlan: PartitionPlan?
if let modelDir = modelDirectory,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if let draftPath = self.draftModel, | ||
| let draftDir = resolveModelDirectory(modelId: draftPath), | ||
| let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) { | ||
| draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824) |
There was a problem hiding this comment.
ModelProfile.weightMemoryGB is computed using decimal GB (weightFileSizeBytes / 1e9), but this converts back to bytes using a GiB multiplier (1_073_741_824). This skews the reserved budget (~7%) and can under/over-cap memory. Prefer using draftProfile.weightFileSizeBytes directly, or multiply weightMemoryGB by 1e9 to stay consistent.
| draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824) | |
| draftFootprintBytes = draftProfile.weightFileSizeBytes |
| let draftFootprintBytes: Int | ||
| if let draftPath = self.draftModel, | ||
| let draftDir = resolveModelDirectory(modelId: draftPath), | ||
| let draftProfile = ModelProfiler.profile(modelDirectory: draftDir, modelId: draftPath) { | ||
| draftFootprintBytes = Int(draftProfile.weightMemoryGB * 1_073_741_824) | ||
| print("[SwiftLM] 📦 Draft model footprint: \(String(format: "%.1f", draftProfile.weightMemoryGB))GB reserved from SSD budget") | ||
| } else { | ||
| draftFootprintBytes = 0 | ||
| } |
There was a problem hiding this comment.
ModelProfiler.profile(...) walks the model directory to sum weight sizes. This PR calls it multiple times for the draft model (early cap + each strategy branch), which repeats filesystem enumeration and can slow startup. Consider computing draftFootprintBytes once (and perhaps the resolved draftDir) and reusing it for the early cap and later physicalBudget calculations/logging.
| draftReserveBytes = 0 | ||
| } |
There was a problem hiding this comment.
Same unit mismatch as above: draftProf.weightMemoryGB is decimal-GB based, but it's converted to bytes using a GiB multiplier (1_073_741_824). Use draftProf.weightFileSizeBytes or a 1e9 multiplier to avoid budget drift.
| - `gh` CLI path on macOS: **`/opt/homebrew/bin/gh`** | ||
| ```bash | ||
| export PATH="/opt/homebrew/bin:$PATH" | ||
| which gh # → /opt/homebrew/bin/gh | ||
| ``` | ||
| - `gh` must be authenticated (`gh auth status`) | ||
| - Working directory: `/Users/simba/workspace/mlx-server` |
There was a problem hiding this comment.
This workflow hard-codes machine-specific paths (e.g. /opt/homebrew/bin/gh, /Users/simba/workspace/mlx-server). That makes the workflow non-portable for other contributors and CI environments. Consider replacing these with placeholders and/or environment-variable based guidance (e.g. "ensure gh is on PATH" and "run from your local repo checkout").
| - `gh` CLI path on macOS: **`/opt/homebrew/bin/gh`** | |
| ```bash | |
| export PATH="/opt/homebrew/bin:$PATH" | |
| which gh # → /opt/homebrew/bin/gh | |
| ``` | |
| - `gh` must be authenticated (`gh auth status`) | |
| - Working directory: `/Users/simba/workspace/mlx-server` | |
| - Ensure the `gh` CLI is installed and available on your `PATH` | |
| ```bash | |
| which gh | |
| gh --version |
ghmust be authenticated (gh auth status)- Run this workflow from your local checkout of the
SharpAI/SwiftLMrepository
Two correctness issues flagged in inline review: 1. GiB/GB unit mismatch — weightMemoryGB is computed as bytes/1e9 (decimal GB), but was multiplied back to bytes using 1_073_741_824 (GiB), causing ~7% budget drift. Fix: use draftProfile.weightFileSizeBytes directly (exact bytes, no conversion needed). 2. Repeated ModelProfiler.profile() filesystem walks — the draft model directory was enumerated once in the early cap block and again in each strategy branch (swapAssisted, layerPartitioned). Fix: compute draftFootprintBytes once before the streamExperts block and reuse it everywhere. Also addresses a third Copilot comment: the early SSD cap was only applied when modelDirectory != nil, so first-run downloads were unprotected. Now the cap is applied whenever --stream-experts is set, even if the model isn't cached yet (modelling via the else-if branch). All 8 SSDMemoryBudgetTests still pass.
|
Addressed all Copilot inline review comments in commit 🔴 GiB/GB unit mismatch (lines 327, 366) 🟡 Repeated 🟡 Early cap not applied on first-run downloads (line 337) 🟢 Workflow hardcoded paths — deferred as agreed (contributor docs, not a runtime bug). |
Problem
Closes #72.
On a 16 GB Mac Mini M4, combining
--draft-modelwith--stream-expertscauses RAM to spike from ~10 GB → 16 GB and trigger swap, despite the draft model being only 4B (~3.5 GB). The main model alone runs fine within the SSD streaming budget.Root Causes
🔴 Bug 1 —
draftConfig.lazyLoadwas never setLine 296 correctly sets
modelConfig.lazyLoad = truefor the main model under--stream-expertsso its weights are mmap'd rather than eagerly paged into unified RAM.draftConfignever received the same flag, so the 4B draft model always loaded all 3.5 GB eagerly into Metal buffers.🔴 Bug 2 —
Memory.cacheLimit/Memory.memoryLimitwere applied after both model loadsThe SSD page-cache budget was set inside the
partitionPlanswitch, which executes afterLLMModelFactory.loadContainer()for both the main and draft models. Neither model loaded under a cache cap, allowing MLX to eagerly fill RAM.🟡 Bug 3 — Draft model footprint not subtracted from
physicalBudgetThe formula
totalRAM × 0.85 − 4 GBdid not account for the draft model's resident pages (~3.5 GB), leaving the cap 3–4 GB too high even after the ordering was fixed.Fix
Early SSD memory cap — immediately after
ExpertStreamingConfig.shared.activate()(before anyloadContainercall), profile the draft model directory, compute its footprint, and apply:Both the main and draft models now load under the correct page-cache budget.
draftConfig.lazyLoad = true— mirrored from the main model config when--stream-expertsis active, so draft weights are mmap'd and only demanded pages enter RAM.Draft-aware
physicalBudget— the same footprint subtraction is applied in theswapAssistedandlayerPartitionedstrategy branches so the reported budget in logs and/healthaccurately reflects available headroom.Expected result on 16 GB M4
📦 Draft model footprint: 3.5GB reserved from SSD budget