Skip to content

feat(persistent): in-place solver updates (Solver framework + HiGHS/Gurobi/Xpress/Mosek)#718

Open
FabianHofmann wants to merge 22 commits into
masterfrom
feat/solver-update
Open

feat(persistent): in-place solver updates (Solver framework + HiGHS/Gurobi/Xpress/Mosek)#718
FabianHofmann wants to merge 22 commits into
masterfrom
feat/solver-update

Conversation

@FabianHofmann
Copy link
Copy Markdown
Collaborator

Changes proposed in this Pull Request

Adds an opt-in persistent-update framework so a built solver can be re-solved against a mutated Model without a full rebuild.

Core

  • linopy.persistent: ModelSnapshot, ModelDiff, StructuralKey, VarKind, RebuildReason. ModelDiff stores changes in flat-native arrays (bounds / var-types / RHS / signs / COO coefs / linear objective / sense) plus per-container VarSlice / ConSlice views.
  • ModelDiff.from_snapshot(snap, model) and ModelDiff.from_models(m1, m2) — snapshot-based and snapshot-free diffs.
  • _coef_dirty flag on constraints with RHS-setter short-circuit so RHS-only edits skip the coefficient re-walk.

Solver orchestration

  • Solver gains track_updates, lazy-build (first solve(model, …) builds), apply_update, update, disallow_rebuild, and structured rebuild reasons. Backends without persistent-update support short-circuit to rebuild.
  • Per-backend apply_update:
    • HiGHSchangeColsBounds / changeColsIntegrality / changeRowBounds / changeCoeff / changeColsCost / changeObjectiveSense. Sign change → rebuild.
    • GurobisetAttr(LB/UB/VType/RHS/Sense/Obj), chgCoeff, ModelSense. In-place sign change.
    • Xpresschgbounds / chgcoltype / chgrhs / chgrowtype / chgmcoef / chgobj / chgobjsense. In-place sign change.
    • Mosekchgvarbound / chgconbound / putvartypelist / putaijlist / putclist / putobjsense. Sign change → rebuild.

Tests

  • New test_persistent_snapshot_diff.py covering all ModelDiff semantics.
  • New parametrized test_persistent_apply_update.py running 9 cases × 4 backends (skipped per backend when license/installation is unavailable).
  • Cross-instance, pickle, and threading coverage in test_persistent_solver_extras.py.

Checklist

  • Code changes are sufficiently documented; i.e. new functions contain docstrings and further explanations may be given in doc.
  • Unit tests for new features were added (if applicable).
  • A note for the release notes doc/release_notes.rst of the upcoming release is included.
  • I consent to the release of this PR's code under the MIT license.

FabianHofmann and others added 19 commits May 19, 2026 15:19
Tracks per-Constraint coefficient mutation via a single boolean slot,
flipped in coeffs/vars/lhs setters. Pure-constant rhs writes now
short-circuit and leave coeffs/vars buffers untouched (by identity),
so rhs-only updates don't trigger expensive coefficient recompare on
the persistent-solver fast path.
Pure-Python snapshot primitives for the persistent-solver Phase 1.
Deep-copies value-side fields (var_lb/ub, con_rhs/sign, obj_linear),
holds vlabels/clabels by reference, stores canonical CSR
(indptr, indices) per constraint container. No Solver import.
Pure-function diff for the persistent-solver Phase 1. Detects
structural, coord, sparsity, quadratic-objective, value-only var/con,
and objective-linear/sense changes. Supports same_model fast path
via _coef_dirty and cross-model full re-scan. Includes a focused
test suite covering capture, mutation paths, deep-copy invariant,
and the same_model toggle.
- supports_persistent_update class flag (default False)
- snapshot/_rebuilds/_in_place_updates/_last_rebuild_reason fields
- snapshot capture at end of direct _build, _clear_coef_dirty helper
- apply_update stub raising UnsupportedUpdate
- solve(model, assign) dispatcher with diff-or-rebuild path
- update(model, apply=True) primitive returning ModelDiff
- threading.Lock around diff+apply+resnapshot
- __getstate__/__setstate__ drop native handle and snapshot
…date support

Skip diff computation entirely when supports_persistent_update is False
on apply, per plan: 'dispatcher checks flag before calling — if False,
skips diffing entirely and goes to rebuild.'
Replace xarray-based snapshot and CSR pattern compare with per-row
canonicalised numpy buffers; new ContainerVarUpdate / ContainerRowUpdate
payloads. Gurobi/HiGHS apply_update rewritten around batched setAttr /
changeColsBounds / changeColsCost / changeColsIntegrality; coefficient
writes touch only changed cells. Cross-model diff now ~matches same-model
cost for bound/rhs/coef-value sweeps.
compute_diff/Solver.solve/Solver.update grow an ignore_dims kwarg.
None (default) keeps the current no-coord-check behaviour;
any iterable opts into per-container coord-equality on every dim
not in the set, supporting rolling-horizon workflows where e.g.
the snapshot dim is expected to drift.
…_rebuild

- Solver.from_name now accepts model=None; the first solve(m, ...) builds.
- compute_diff folded into ModelDiff.from_snapshot classmethod; new
  ModelDiff.from_models diffs two linopy models directly.
- Solver.solve grows disallow_rebuild=True, which raises
  RebuildRequiredError instead of falling back to a rebuild.
…m_models

- Add `track_updates` flag (default False) to Solver; skip ModelSnapshot
  capture when disabled. Raise UpdatesDisabledError on solve(model)/update()
  if a built solver was constructed without tracking.
- Rewrite ModelDiff.from_models to build directly from two models without
  capturing snapshots; share helpers with from_snapshot.
- Update persistent tests to opt into track_updates=True; add coverage
  for the disabled path.
Cross-instance resolves now diff via from_models against the previously
built model, with no snapshot. Same-instance mutation still raises
UpdatesDisabledError. Snapshot recapture is skipped in this mode.
Add cross-instance solve/update tests for the no-snapshot path.
Collapse _diff_objective QUAD_OBJ branches; cache n_coef_updates;
short-circuit _canonicalize_rows when rows already sorted; tighten
buffer extraction. Introduce VarKind enum used across snapshot/diff
and HiGHS/Gurobi apply_update; reuse linopy.constants sign tokens.
Move _clear_coef_dirty into ModelSnapshot.capture.
Source con buffers from Constraint.to_matrix_with_rhs, replacing the
dense (n_rows, max_n_term) arrays with CSR (indptr, indices, data).
Sign dtype adopts 'U1' across the persistent layer and apply_update
in HiGHS/Gurobi consumes CSR-slice payloads instead of -1 masks.
Deletes _canonicalize_rows and the _INT64_MAX sentinel.
Replace per-container ContainerVarUpdate/ContainerRowUpdate dicts with
flat arrays (var_bounds_*, var_type_*, con_coef_* COO, con_rhs_*,
con_sign_*) plus VarSlice/ConSlice per-container offsets for
diagnostics. Add con_rhs_as_bounds() for ranged-row solvers. Backend
apply_update bodies collapse to flat-array calls; remove duplicated
label->position resolution.
Implement in-place model updates for Xpress (chgbounds/chgrhs/chgmcoef/
chgrowtype/chgobj/chgobjsense/chgcoltype) and Mosek (chgvarbound/
chgconbound/putaijlist/putclist/putvartypelist/putobjsense). Mosek
rejects constraint sign change to trigger rebuild. Consolidate
gurobi/highs apply_update tests into a single parametrized file that
also covers xpress and mosek.
@FabianHofmann FabianHofmann requested review from FBumann and coroa May 21, 2026 15:21
@FabianHofmann
Copy link
Copy Markdown
Collaborator Author

@FBumann here we go, if you want to take a first look. docs to come

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 22, 2026

@FBumann here we go, if you want to take a first look. docs to come

Ill have a look ltoday

* hold solver lock through _run_direct so two threads calling
  solve(model) on the same Solver no longer race on the native handle
  (HiGHS returned 0.0 from the second concurrent solve).
* narrow Optional ndarrays in persistent.diff.push_var / push_con and
  in HiGHS/Gurobi/Xpress/Mosek apply_update objective paths.
* widen Constraint.rhs setter to ExpressionLike | VariableLike |
  ConstantLike to match the as_expression call in the body.
* widen Constraints.__getitem__(str) return type to Constraint (the
  dominant case) so tests can set .rhs/.coeffs/.sign without ignores.
* add docs for in-place solver updates.
@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 22, 2026

@FabianHofmann Sorry, I wont be able to review this today.

@FabianHofmann
Copy link
Copy Markdown
Collaborator Author

@FabianHofmann Sorry, I wont be able to review this today.

take your time, there is no hurry. I'll do some integration tests anyway

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 22, 2026

Are there any conflicts with #717? Are we sure we want to merge and publish this before we go v1?

Just making sure...

@MaykThewessen
Copy link
Copy Markdown
Contributor

MaykThewessen commented May 23, 2026

Benched PR #718 ModelDiff vs hand-rolled changeRowsBounds path on a synthetic 69K-var / 160K-row DC-OPF, 8 weekly-style chunks (RHS + objective mutations only, identical structure across chunks). Per-chunk averages, chunk 0 excluded:

route t_build t_diff t_apply t_solve t_total in_place rebuilds
blueprint + changeRowsBounds (hand-rolled) 0.000 0.001 0.006 0.759 0.766 0 0
ModelDiff sweep (fresh Model per chunk) 0.222 0.012 0.039 12.365 12.652 7 0
ModelDiff in-place (mutate model.constraints[name].rhs.values[...]) 0.003 0.010 0.041 4.919 4.982 7 0
ModelDiff in-place + manual solver.solver_model.clearSolver() 0.003 0.010 0.041 0.572 0.637 7 0

Diff compute + apply_update sum to ~50 ms — those are not the bottleneck. The regression is entirely in t_solve: after in-place RHS mutation the persistent path keeps HiGHS's prior basis, and HiGHS then skips presolve ("Solving LP with useful basis so presolve not used"). For LPs with strong presolve reduction (89 % row reduction on our production model, similar on this synthetic) the skipped presolve costs more than the warm basis saves.

Manually calling solver.solver_model.clearSolver() between solver.update() and solver.solve() recovers presolve and brings the persistent path slightly under the hand-rolled baseline. But that defeats the encapsulation — the user reaches past the Solver API to touch the native HiGHS handle.

Suggestion: expose an opt-in on the persistent API so users with presolve-heavy LPs can drop the warm basis without bypassing the encapsulation. Either:

  • a keep_basis: bool = True kwarg on Solver.solve() / Solver.update(), or
  • a Solver.clear_basis() method on the base class with backend-specific implementations (Highs.clearSolver(), Gurobi reset(), etc.).

Zero structural rebuilds observed across all 7 re-solves in both ModelDiff routes (RebuildReason.NONE). The diff correctly classifies these as RHS + coefficient-only changes — the persistent API is doing its job, the missing piece is just the basis-clear hook.

Caveats:

@MaykThewessen
Copy link
Copy Markdown
Contributor

Production-scale follow-up: PSA-MEE Validation_w2025 LP measures 1.14 M cols × 1.97 M rows per weekly chunk (DC-OPF + ramp-penalty aux vars). The persistent-update phase — changeRowsBounds + changeColsCost on the time-varying rows (~15 % of the matrix), the equivalent of solver.update(model, apply=True) in the PR #718 API — completes in 0.16 s at this scale. Linear scaling from the synthetic bench's ~50 ms, so ModelDiff's diff + apply overhead stays negligible at production size.

Solve-phase comparison (B vs SQ vs B+clear) at production scale couldn't be captured this round — the reduced-pipeline harness wasn't LP-feasible past chunk 1, so HiGHS timings came from an infeasibility path and would mislead. Will retry once the harness uses the full PSA-MEE chunk loop with storage + N-1 enabled. The presolve-skip mechanism that drove the synthetic regression (HiGHS logging "Solving LP with useful basis so presolve not used" after warm in-place RHS updates) is shape-independent, so I expect the gap to widen with the production LP (89 % presolve reduction on this model — more lost when skipped) — but not posting that as a result until it's measured.

Bottom line for the PR scope: the solver.update(...) / solver.solve(model) mechanism is the right primitive and the linear scaling holds. The remaining ask is still the opt-in basis-clear hook — without it, every backend that gates presolve on basis presence (HiGHS does, others may) leaves the persistent path slower than solve_problem_from_model on warm-state-hostile LPs.

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 24, 2026

@MaykThewessen @FabianHofmann
I would argue that we should (for now) not provide methods like clear_basis()(or clear_mip_state(), clear_pdlp_iterate(), clear_scaling() etc.).
Those are properly exposed by the Native solver instances and support and naming differs between solvers.
Users should instead do Model.solver.solver_model.*. This also saves us from differing in default behaviour from the respective solver.

We might be able to add such later. But I'd definitely defer it from this first PR. We should put a note into the docs about users maybe profiting from calling such methods on the solver directly.

One downside: We propose suage of Model.solver.solver_model, which should therefore stay stable.

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 24, 2026

I read through the code and overall api. I would propose the following api guide lines:

1. Mutation API: Variable.update() / Constraint.update() — and only that

m.variables["x"].update(lower=new_lb, upper=new_ub)
m.constraints["demand"].update(rhs=new_rhs, sign="<=")
  • One canonical mutation path. Routes through validation&alignment; flips _coef_dirty in one place.
  • Drop the existing .lower / .upper / .rhs / .sign / .lhs / .coeffs / .vars setters. The read-side properties stay; mutation via = becomes AttributeError with a pointer to .update(). Easy to add back if users push back — much harder to remove later once two APIs coexist.
  • The Constraint.rhs.setter magic (rhs=Variable/Expression silently rewriting lhs) becomes an explicit method.
  • .values[...] stays possible (xarray exposes it) but not supported: snapshot diff stays array-based as a safety net, (needed to compare different models anyway), but no contract that direct mutation propagates.

2. No clear_basis() — use solver.solver_model directly

solver.apply(model_new)
solver.solver_model.clearSolver()    # backend-specific
solver.solve(assign=True)
  • Each backend has one obvious method; no shared logic to abstract.
  • Adding clear_basis() opens a cascade (clear_mip_state, clear_pdlp_iterate, clear_scaling, …) where every method needs per-backend semantics that don't unify cleanly (Mosek IPM has no "basis" in HiGHS's sense).
  • Linopy's value is build + diff + apply. Native state management is the user's.
  • Document the escape hatch in the persistent-solver guide with HiGHS + Gurobi recipes.

3. Split Solver.update(model, apply=...) into diff() / apply()

call reads model mutates handle runs solver
solver.diff(model)
solver.apply(model)
solver.solve(model=...)
  • update(model, apply=True|False) does two structurally different things behind one bool; default apply=True silently mutates.
  • Name collides with gurobipy.Model.update() (lazy-flush attribute changes) — different semantic.
  • Split makes the basis-clear flow legible (apply separately from solve).

Good news: No conflicts with the new v1 semantics convention

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 24, 2026

Besides the v1 semantics, im also working on making add_variables() and the underlying as_dataarray more predictable and stictly follow the passed coords. THis should simplify some validation here i think.
See #722 #725 #726

@MaykThewessen
Copy link
Copy Markdown
Contributor

@FBumann Agreed: drop clear_basis() (and siblings) from this PR.

Native handle via Model.solver.solver_model.* already covers the power-user case, and any wrapper would have to flatten genuinely divergent semantics:

solver basis/state reset
HiGHS h.clearSolver() (drops basis + dual ray)
Gurobi m.reset(0) solution only, m.reset(1) incl. basis
Xpress p.postsolve() / p.clearmipfilter() (MIP)
Mosek task.putintparam(iparam.bi_clean_optimizer, ...)
CPLEX c.cleanup(epsilon)
PDLP iterate is not user-exposed at all

A uniform clear_basis() would either lie about what it cleared (HiGHS clears more than basis; Gurobi reset(1) clears less than HiGHS clearSolver) or force per-solver branching back into linopy. Both worse than solver_model.*.

My benchmark comment's "B+clear" route was exploratory, not a request for the wrapper. Happy to leave basis-management outside this PR and revisit only if a clear cross-solver use case emerges.

Suggested docs note for doc/release_notes.rst (or wherever the persistent-solver section lands):

Resetting solver state. Solver.update() preserves the warm-started basis / MIP state by default. Solvers expose basis-, MIP-, scaling-, and iterate-reset under different names; linopy does not unify them. If a re-solve benefits from discarding internal state, call the native solver directly via Model.solver.solver_model. Examples: HiGHS m.solver.solver_model.clearSolver(); Gurobi m.solver.solver_model.reset(1); Xpress m.solver.solver_model.postsolve(). See each solver's API docs for the full menu.

LGTM to defer the unified API: additive later, non-breaking.

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 25, 2026

@FabianHofmann first round of my Review:

Architecture

A1. _coef_dirty is a correctness footgun glued to a perf shortcut.
linopy/persistent/diff.py:401-402 enables skip_coef_compare = same_model and not coef_dirty, and from_snapshot(same_model=True) is the default (diff.py:346). So the flag is trusted: if it says nothing changed, the CSR walk is skipped.

Through any supported mutation path — the existing setters at constraints.py:1122,1136,1180, and the new Constraint.update(...) in #727 (which sets the flag from one place; setters become shims) — this is precise: True iff coefs were touched. Through c.coeffs.values[...] = ... (or any in-place numpy mutation) the flag is stale, and same_model=True means the diff silently misses it. Backend keeps stale coefs. Silent wrong result.

Two coherent positions:

  • A. Keep same_model=True default. .values[...] is documented as unsupported; users who hold up their end get correctness and the optimization. Risk: silent wrong diff on any unsupported mutation.
  • B. Default same_model=False. The flag becomes an opt-in perf hint. Cost: extra CSR walk per same-model warm update. Benefit: footgun closed regardless of user discipline.

Recommendation: B is the safer default. Cross-model paths already use False; in the same-model path the perf cost is small. Solver-aware callers who promise to use .update() exclusively can opt back into the optimization with same_model=True.

A2. Read-only ModelSnapshot.capture mutates the model.
snapshot.py:159-161 walks every Constraint and resets _coef_dirty=False. capture reads as a pure observation, but it isn't — anyone calling it for inspection (debug, tests) silently changes the model. The dirty-clear belongs in the solver-side bookkeeping (Solver._update_locked after a successful apply), not in the snapshot constructor.

A3. from_snapshot vs from_models is two parallel walkers.
diff.py:341-525 — the two classmethods share ~80% of their body (structural checks, per-container var/con/objective walks, builder finalization). They differ only in whether the baseline buffers come from a stored ModelSnapshot or are freshly _extract_*'d from model_a. Factor out a single _diff(get_var_base, get_con_base, get_obj_base, same_model) core; the two public entry points become 10-line adapters.

A4. Per-backend apply_update reinvents the dispatch frame 4×.
solvers.py:1404, 1891, 2455, 3183 — each backend has the same seven blocks in the same order (var_bounds, var_type (+binary bound re-clamp), con_rhs, con_sign, con_coef, obj_c, obj_sense), each guarded by if diff.<...>.size:. The only thing that varies is the API call. A template method in Solver iterating sections plus per-backend hooks (_apply_var_bounds, _apply_coefs, …) collapses ~600 lines and removes drift risk.

A5. Lock is too coarse.
solvers.py:704 holds self._lock for the entire solve(...) body, including the actual native solve. This serializes solves on the same Solver instance for the full solver runtime. Protecting solver_model/snapshot mutation is correct; holding through _run_direct for minutes is not. Narrow to the build/diff/apply window, then drop the lock around the run.

A6. Snapshot is the wrong unit of caching.
_update_locked re-captures the entire snapshot after every successful in-place update (solvers.py:811). If you're tracking updates so each iteration is fast, paying an O(nnz_total) re-capture per iteration defeats the purpose. Either advance the snapshot incrementally from the applied diff (cheap — the diff already contains the new values + indices), or recompute lazily only when a diff is actually requested.


DRY

D1. from_snapshotfrom_models (see A3). The biggest DRY hit in the new code.

D2. _HIGHS_VTYPE_MAP lazy-init pattern (solvers.py:1402, 1469-1477). cls._HIGHS_VTYPE_MAP or self._init_highs_vtype_map() runs on every apply_update. A @functools.cache classmethod returning the map is one method, no class-state mutation, no per-call truthiness check.

D3. Xpress .astype(np.int64, copy=False).tolist() repeated 7× in apply_update (lines 2496-2517). A _to_int_list(arr) helper or pre-converting once would clean it up.

D4. "Set binary integrality and then re-clamp bounds to [0,1]" is hand-rolled in HiGHS (1433-1442), Xpress (2485-2492), MOSEK (3221-3225). Lift to a Solver utility — the rule is solver-independent (it's linopy semantics, not a backend quirk).

D5. _extract_var_buffers / _extract_con_buffers are imported privately from snapshot.py by diff.py (diff.py:12-21). When two modules share private helpers, the helpers belong in a shared internal module, not the one that happened to introduce them.

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 25, 2026

Test coverage gaps — RebuildReason audit

Reason Covered?
NONE
STRUCTURAL_CONTAINERS ⚠️ only in in {LABELS, CONTAINERS} sets
STRUCTURAL_LABELS ⚠️ same; per-container branch (diff.py:557,606) unreachable from public API
COORD_REINDEX
SPARSITY
QUAD_OBJ (diff.py:702) ❌ no tests
BACKEND_REJECTED

Gaps worth fixing:

  1. QUAD_OBJ has zero tests — both directions (linear→quad, quad→linear) are reachable. Without coverage, QP users risk silent miscompare.
  2. Loose in {…} assertions in test_sparsity_change_triggers_rebuild, test_cross_model_sparsity_change_rebuilds, test_cross_model_structural_mismatch_rebuilds — tighten to is so future refactors of check ordering get caught.
  3. disallow_rebuild=True only tested for one reason (STRUCTURAL_CONTAINERS). The contract is "every rebuild trigger raises" — add tests for SPARSITY, COORD_REINDEX, QUAD_OBJ, BACKEND_REJECTED.
  4. from_models negative paths uncovered — only the happy path is tested. Mirror the from_snapshot cases (structural/sparsity/quad/coord) since the buffer-extraction path is independent.
  5. Dead code or missing test at diff.py:557 and :606 — the per-container STRUCTURAL_LABELS returns can't fire after the global label check at :372-377 short-circuits. Either delete (replace with assert) or write the test that justifies them.

@MaykThewessen
Copy link
Copy Markdown
Contributor

@FBumann Strong review. Adding bench-grounded color on A1 / A5 / A6 since my numbers are the ones being cited.

A1: _coef_dirty default

Agree with Option B (default same_model=False, flag as opt-in perf hint). One framing nit: the "~50-200 ms at Mayk-scale, small relative to solve time" line under-sells the cost. At the 1.14 M × 1.97 M production size, the apply phase is 0.16 s; a 50-200 ms CSR walk would be 30-125% of apply, not 30-125% of solve. Still small absolute, and still the right call. But if the trade-off is sold as "free", reviewers will be surprised when the apply-phase profile shows the walk dominating once a real solver loop runs hundreds of iterations.

Concrete: I'd take B plus a same_model=True opt-in documented as "caller asserts only .update() / typed setters were used; .values[...] invalidates this guarantee." Solver-aware orchestrators (the use case driving #718) will set it; ad-hoc users get correctness.

A2: ModelSnapshot.capture mutating _coef_dirty

Agree. Moving the clear into Solver._update_locked after a successful apply is correct on two grounds: (1) snapshot becomes a pure observation; (2) the clear becomes coupled to "the solver actually saw these coefs", which is what the flag is supposed to mean.

A3 / A4 / D-series

No bench-side input. Agree the DRY hits are worth taking. A4 in particular: four parallel 7-block dispatch bodies will drift the moment one backend grows an 8th section.

A5: lock around full solve

Confirmed real at production scale. My weekly-chunk run holds the solve phase for minutes; with the current solvers.py:704 lock width, any concurrent solve() on the same Solver blocks for the full run. Persistent-update orchestrators (the use case here) are exactly the ones likely to want concurrent solves on sibling models or pipelined iterations. Narrowing the lock to build/diff/apply is the right scope.

A6: re-capture after every successful update

This one bites at production size. _update_locked re-capturing the full snapshot at solvers.py:811 is O(nnz_total) per iteration. At 1.14 M × 1.97 M with ~6× nnz/row, that's tens of millions of cells re-read per chunk just to re-establish a baseline the diff already knew how to advance. The diff carries (indices, new_values) for every mutated buffer; applying that delta to the held snapshot is O(|diff|), which is what the persistent path is for. Strong +1 to "advance the snapshot incrementally from the applied diff." Lazy recompute on next diff() is a fine fallback if incremental advance is awkward for some backend.

Test gaps

All five gaps look right. QUAD_OBJ is the one I'd treat as blocker: QP users are exactly the audience for #718 (warm-starting a barrier solve from a previous iterate), and silent miscompare on a quad to linear flip would be hard to diagnose downstream.

The disallow_rebuild=True coverage gap is structural: the contract is "every rebuild trigger raises", but the test only proves one of six does. A parametrized test over RebuildReason enum members closes it in five lines.

Happy to take A6 (incremental snapshot advance) as a follow-up PR after #718 lands if Fabian wants to keep this one focused; the API surface doesn't change, only the body of _update_locked.

@MaykThewessen
Copy link
Copy Markdown
Contributor

Correction to my A1 framing: I overshot.

Microbench of the actual compare path (np.array_equal on labels/indptr/indices + new_buf.data != base_buf.data + the row_value_changed build at diff.py:624-629), numpy 2.4.6, M1, median of 7:

scale n_rows nnz extra cost (same_model=False vs True)
synth-69k 69 k 0.2 M 0.5 ms
mid-300k 300 k 1.5 M 3.2 ms
prod-1.97M @ 6 1.97 M 11.8 M 34.9 ms
prod-1.97M @ 10 1.97 M 19.7 M 54.6 ms

So at the realistic production size (6 nnz/row), the extra walk is ~35 ms, not the 50-200 ms my last comment implied as the floor. That's ~22 % of a 160 ms apply phase, or ~0.3 % of a 12 s solve. Felix's original "small relative to solve time" framing was right; my "30-125 % of apply" was the worst-case interpretation, not the typical one.

Doesn't change the recommendation (still Option B: default same_model=False, flag as opt-in perf hint). Just wanted the numbers on the record so reviewers don't price the trade-off at 3-5× what it actually costs.

Bench script available on request, or I can drop it in benchmark/ if useful.

FBumann added a commit that referenced this pull request May 25, 2026
Closes the A1 residual from the #718 review. The flag-trust path
(`skip_coef_compare = same_model and not coef_dirty`) is correct
through Constraint.update() (set in one place, shims forward), but
`c.coeffs.values[...] = ...` still bypasses _coef_dirty. With
same_model=True as the default, that bypass silently produces wrong
diffs.

Flip the default to False. Cross-model paths (the only production
caller, Solver._update_locked, passes explicitly) are unaffected.
Same-model warm-update paths now value-diff the CSR data — small
perf hit (50-200ms at Mayk-scale per Mayk's bench), correct by
default. Solver-aware callers who own the mutation contract can
opt back into the optimization with `same_model=True`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coroa
Copy link
Copy Markdown
Member

coroa commented May 26, 2026

Have not reviewed the code yet, just read the comments during some spare time, but my position is always to try to have perf by default:

A. Keep same_model=True default. .values[...] is documented as unsupported; users who hold up their end get correctness and the optimization. Risk: silent wrong diff on any unsupported mutation.
B. Default same_model=False. The flag becomes an opt-in perf hint. Cost: extra CSR walk per same-model warm update. Benefit: footgun closed regardless of user discipline.

So, i'd argue for the footgun A, but put the security mechanism on by flagging the numpy arrays as read only:

a = np.array([1, 2, 3])
a.flags.writeable = False
a[0] = 99  # ValueError: assignment destination is read-only

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 26, 2026

@coroa I wrote some comments about the user api here: #727

In short:

We could skip the user parameter entirely by storing a weakref to the source model on ModelSnapshot.

Together with your suggestion about the writable flag this would be very safe, convenient and uses the performant branch whenever possible.

@FBumann
Copy link
Copy Markdown
Collaborator

FBumann commented May 26, 2026

@coroa I asked Claude code** about potential gaps of the writeable flag:

  1. Dask.flags.writeable is numpy-only; dask-backed buffers ignore it. Inconsistent enforcement on mixed-backend models.
  2. .view() escapev = arr.view(); v.flags.writeable = True; v[...] = 99 re-arms the protection. Stops casual mutation, not deliberate.
  3. Lifecycle — needs an explicit thaw on every rebuild / close / pickle round-trip path. Easy to miss one.
  4. Internal paths — every linopy internal that touches .values[...] during a frozen window breaks (sanitize_zeros, SOS reformulation, …). Needs an audit.
  5. Error UXValueError: assignment destination is read-only doesn't explain why or how to fix.
  6. Doesn't cover cross-model — orthogonal to the from_snapshot(snap_from_m1, m2) footgun.

Feels 80% safe, but might produce rough edges in our internals and many places where we have to reassign this flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants