Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
558 commits
Select commit Hold shift + click to select a range
30171f8
feat(bench): integrate NDIter into run_benchmark.py (Option B); +10M …
Nucs Jun 13, 2026
62c2ef1
bench(nditer): refresh canonical sheet with 10M tier; AV->NA proven o…
Nucs Jun 13, 2026
3e23a18
docs(website): add Benchmarks vs NumPy page driven by the auto-commit…
Nucs Jun 13, 2026
829f645
docs(bench): richer two-card story + render full reports into the Doc…
Nucs Jun 13, 2026
81622de
fix(bench): op-matrix report ranked measurement artifacts as the "Top…
Nucs Jun 13, 2026
4dc6f13
docs(pr): amend PR #611 changelog with the post-changelog wave (Waves…
Nucs Jun 13, 2026
bc96f6c
docs(bench): prototype a dense, numbers-first NumPy-vs-NumSharp dashb…
Nucs Jun 13, 2026
f233a50
fix(net8.0): NumPy-correct complex abs and axis min/max NaN propagation
Nucs Jun 13, 2026
fbf254f
fix(bench): align the dashboard to the house NS/NP convention (<1× = …
Nucs Jun 13, 2026
10c27b2
docs(pr): represent final state — drop roadmap-wave/phase framing fro…
Nucs Jun 13, 2026
0a252df
fix(bench): dashboard back to NP/NS speedup + add 🕐 %NumPy time-share…
Nucs Jun 13, 2026
df6dcfa
bench: one convention everywhere — NP/NS speedup + 🕐 %NumPy time-share
Nucs Jun 13, 2026
8de714f
bench: stick the 🕐 after the % (NN%🕐), drop the leading-gap "🕐 NN%"
Nucs Jun 13, 2026
866fdeb
bench: %NumPy is ALWAYS a percentage — drop the 880×NP / 880× compact…
Nucs Jun 13, 2026
5f29b21
bench(dashboard): read the merge's canonical ratio/pct — stop driftin…
Nucs Jun 13, 2026
bdf4fbf
bench(merge): canonicalize 3 op-name aliases — recover 10 falsely-⚪ "…
Nucs Jun 13, 2026
fe986ec
feat(complex): implement sinh/cosh/tanh/arcsin/arccos/arctan for Comp…
Nucs Jun 13, 2026
31f26b3
perf(complex): inline-friendly hot/cold split for the complex transce…
Nucs Jun 13, 2026
8084e8f
fix(memory): dispose owned intermediates in np.isclose and np.random.…
Nucs Jun 13, 2026
d799b11
fix(complex): port NumPy's own algorithms for complex unary math (par…
Nucs Jun 13, 2026
87f83fc
bench: close the op-matrix coverage gaps — add missing defs + full re…
Nucs Jun 13, 2026
044255e
fix(complex): reject narrowing dtype= on complex float-ufuncs (was a …
Nucs Jun 14, 2026
a202ae7
perf(creation): np.zeros via calloc + Windows VirtualAlloc demand-zer…
Nucs Jun 14, 2026
ddb0fa9
fix(exp2): correct malformed float32-output IL kernel (was InvalidPro…
Nucs Jun 17, 2026
bfcb4a4
fix(power): Half exponent no longer throws InvalidCastException (W1-B)
Nucs Jun 17, 2026
99d500d
fix(maximum/minimum/clip): propagate NaN through the clip SIMD kernel…
Nucs Jun 17, 2026
8df1046
fix(maximum/minimum/clip): correct F-contiguous/strided element pairi…
Nucs Jun 17, 2026
6cb9b80
perf(clip): aggressively inline/optimize the per-element extrema helpers
Nucs Jun 17, 2026
b967b4e
perf(kernels): aggressively inline/optimize the remaining per-element…
Nucs Jun 17, 2026
bc78f43
perf(kernels): complete AggressiveInlining|AggressiveOptimization on …
Nucs Jun 17, 2026
369ea36
perf: extend AggressiveInlining|AggressiveOptimization to all small h…
Nucs Jun 17, 2026
d1a2bc2
bench(complex-reduce): POC suite diagnosing + benchmarking complex128…
Nucs Jun 17, 2026
94c68fb
docs(reduce): NDIter reduction parity + fusion execution plan, with d…
Nucs Jun 17, 2026
a91f32c
perf(complex): NDIter axis reductions — fix complex mean (15–45×→pari…
Nucs Jun 17, 2026
1736d4c
perf(half/decimal): NDIter axis reductions — Decimal all-ops 5–13×, H…
Nucs Jun 17, 2026
782cdac
perf(reduce): transposed/strided axis reductions to parity-or-better …
Nucs Jun 17, 2026
cac7867
feat(evaluate): axis-aware fused reductions — evaluate(Sum(a*b, axis:…
Nucs Jun 17, 2026
8bf7c3a
docs(reduce): record Phase 5b/6 as skipped — premise-invalidated by m…
Nucs Jun 17, 2026
4a43ebb
perf(reduce): Phase 6 step 1 — migrate Double Sum/Mean to per-chunk S…
Nucs Jun 17, 2026
c82b857
docs(reduce): record parallel-reduce proof (2-6x vs NumPy) + decision…
Nucs Jun 18, 2026
e7bfc08
fix(complex): flat min/max return the NaN-bearing element verbatim (N…
Nucs Jun 18, 2026
443e162
feat(reduce): pairwise summation on NDIter — bit-exact NumPy sum/mean…
Nucs Jun 18, 2026
8e90ca2
perf(reduce): IL-emit SIMD pairwise sum per dtype — bit-exact NumPy, …
Nucs Jun 18, 2026
2281640
perf(reduce): IL-emit Vector128 pairwise sum for Complex128 — bit-exa…
Nucs Jun 18, 2026
07756b0
fix(reduce): float16 axis sum accumulates in float32, not float16 (Nu…
Nucs Jun 18, 2026
e14f1db
fix(reduce): bool axis min/max identity + NDIter reduce Shape.offset …
Nucs Jun 18, 2026
6907a6b
bench(reduce): add reduction × layout × dtype × op parity matrix (NDI…
Nucs Jun 18, 2026
0ce905b
perf(reduce): SIMD-route non-C-contiguous int-widening axis sum (kill…
Nucs Jun 19, 2026
20897e4
perf(reduce): materialize broadcast views in flat reduction (bcast-re…
Nucs Jun 19, 2026
b892d0c
bench(elementwise): add op × layout × dtype matrix to probe non-conti…
Nucs Jun 19, 2026
6678e27
bench(reduce): refresh layout×dtype matrix after the reduction perf f…
Nucs Jun 19, 2026
504a813
bench(copy): NDIter copy path vs NumPy across all 15 dtypes × 7 layouts
Nucs Jun 19, 2026
6cb21e8
docs(claude): canonical Performance Convention — NPY/NS, >1 = NumShar…
Nucs Jun 19, 2026
60a5b5a
perf(reduce): bool/char/half min/max-along-axis — kill the scalar dou…
Nucs Jun 19, 2026
bda8d9d
perf(reduce): fold broadcast axes in place instead of materializing (…
Nucs Jun 19, 2026
a99b074
perf(copy): strided/broadcast same-type clone for the Vector-less dty…
Nucs Jun 19, 2026
edb329f
perf(reduce): flat char/half min/max — reuse the per-dtype trick on t…
Nucs Jun 19, 2026
cde7b09
fix(reduce)+perf: half flat sum/prod/mean — boxing-free contiguous sc…
Nucs Jun 19, 2026
5aef842
test(reduce): adversarial broadcast-reduce sweep — fold verified bug-…
Nucs Jun 19, 2026
a08d663
perf(cast): typed strided cross-dtype cast for the Vector-less dtypes…
Nucs Jun 19, 2026
eeae1d7
perf(reduce): migrate non-contig Half/Complex reduction fallbacks fro…
Nucs Jun 19, 2026
25f92fa
perf(unary): float16 negate via sign-bit flip, not the BCL float roun…
Nucs Jun 19, 2026
962b893
fix(reduce): complex nansum axis reduction read uninitialized memory …
Nucs Jun 19, 2026
10f5c91
perf(cast): IL-emitted scalar cast kernel — direct `call Converts.ToX…
Nucs Jun 20, 2026
b379e32
docs(plan): final cast-optimization plan — beat NumPy at every execution
Nucs Jun 20, 2026
117ce34
docs(plan): cast entry-point unification — route buffered + assignmen…
Nucs Jun 20, 2026
39b094b
bench(cast): Phase 0 full-matrix discovery — astype 15×8×15 sweep rep…
Nucs Jun 20, 2026
44cfbaa
docs(cast-plan): PROVE float→narrow-int kernel by benchmark — correct…
Nucs Jun 20, 2026
0fca7e1
perf(reduce)+fix: Half/Complex/bool flat reductions via struct-generi…
Nucs Jun 20, 2026
9d043fe
perf(cast): SIMD float->narrow-int (cvtt+truncating-Narrow) — kills P…
Nucs Jun 20, 2026
ddf8080
docs(cast-plan): PROVE all 5 remaining cast cliff families by benchma…
Nucs Jun 20, 2026
8ae34d1
perf(cast): SIMD {int,float}->bool (!=0 compare) — Phase-0's worst ds…
Nucs Jun 20, 2026
4bfad24
perf(broadcast): make np.broadcast(...).iters lazy — no eager iterato…
Nucs Jun 20, 2026
d4d75e4
docs(cast-plan): ROOT-CAUSE + prove the last routing cliff (same-type…
Nucs Jun 20, 2026
284de6f
perf(cast): vectorized f16->{bool,i8,u8,i16,u16,char,i32} via bit-fid…
Nucs Jun 20, 2026
6c7cbb9
revert(cast): remove this session's cast SIMD kernels — superseded by…
Nucs Jun 20, 2026
1d9f076
refactor(iterators): retire legacy NDIterator — [Obsolete] tombstones…
Nucs Jun 20, 2026
b063454
test(broadcast): NumPy-parity coverage for np.broadcast(...).iters ac…
Nucs Jun 20, 2026
4fd1b2d
feat(broadcast): np.broadcast accepts 0..64 operands (NumPy parity), …
Nucs Jun 20, 2026
73890d2
feat(broadcast): live index cursor + iteration + reset() — align np.b…
Nucs Jun 20, 2026
668f5e5
perf(cast): SIMD float->narrow-int + complex->int (Waves 1-2) — kill …
Nucs Jun 20, 2026
4f0d5ee
perf(cast): SIMD Half->int via Giesen bit-fiddle widen (Wave 3)
Nucs Jun 20, 2026
7e479b5
perf(cast): SIMD {int,float,half,char}->bool via !=0 compare (Wave 4)
Nucs Jun 20, 2026
724ce61
perf(broadcast): scalar-broadcast same-type clone via fast fill, not …
Nucs Jun 20, 2026
d9ecb75
bench(cast): post-waves full matrix re-run — 716 lagging cells -> 461…
Nucs Jun 20, 2026
f113660
feat(broadcast): drop the 64-operand cap — match NumSharp's unlimited…
Nucs Jun 20, 2026
ab76957
test(broadcast): prove np.broadcast scales to N operands like NDIter …
Nucs Jun 20, 2026
18770db
perf(cast): fused VPGATHER whole-array kernels for f32/f64->narrow st…
Nucs Jun 20, 2026
b2fe47e
perf(cast): fused VPGATHER whole-array kernels for {f32,f64,i32,u32,i…
Nucs Jun 20, 2026
096c1ca
bench(cast): refresh cast-matrix scoreboard after Wave 7 fused-gather…
Nucs Jun 20, 2026
3cd5ca5
perf(cast): fused VPGATHER whole-array kernel for f32->i32 strided (W…
Nucs Jun 20, 2026
05c890a
bench(cast): refresh scoreboard after Wave 7c (f32->i32 strided)
Nucs Jun 20, 2026
5064aa0
refactor(iterators): delete NDIterator entirely — it was fully dead code
Nucs Jun 20, 2026
2b401e1
perf(cast): fused VPGATHER whole-array kernels for int->narrow stride…
Nucs Jun 20, 2026
87a921b
perf(cast): drop stale signed->UInt64 SIMD-widen rejection (Wave 9)
Nucs Jun 20, 2026
2a6d540
docs(website): refocus iterator docs on NDIter; drop deleted NDIterat…
Nucs Jun 20, 2026
bdf964a
perf(cast): gather-real deinterleave for c128->narrow inner-strided (…
Nucs Jun 20, 2026
4b48416
perf(cast): SIMD Giesen float->f16 narrow for {bool,u8,i8,i16,u16,cha…
Nucs Jun 20, 2026
130665b
perf(cast): SIMD u32/f64/c128 -> f16 — finish the f16 column except i…
Nucs Jun 20, 2026
9158447
docs: remove 10 stale/superseded planning & handover docs + fix dangl…
Nucs Jun 20, 2026
79c8430
perf(cast): single-pass KEEPORDER same-type copy — kill the F-contig …
Nucs Jun 20, 2026
1ef0854
docs(NDIter): correct stale bug status, add memory-overlap + adoption…
Nucs Jun 20, 2026
2ef24b5
docs(CLAUDE.md): retarget to current-state snapshot, drop migration n…
Nucs Jun 20, 2026
b4cb76d
perf(cast): SIMD char->{i8,u8} contiguous narrow — close the generic …
Nucs Jun 20, 2026
ddf0921
docs(NDIter): make the page a final-state reference (drop bug list + …
Nucs Jun 20, 2026
e8c80a5
perf(cast): SIMD complex128 -> bool via deinterleave + nonzero compar…
Nucs Jun 20, 2026
09e978b
bench(cast): refresh cast-matrix scoreboard at HEAD (Waves 8-14)
Nucs Jun 20, 2026
ec3386c
docs(CLAUDE.md): second current-state pass — deleted class, random su…
Nucs Jun 20, 2026
9c325d9
docs(NDIter): second final-state pass — neutral wording + GROWINNER a…
Nucs Jun 20, 2026
f34bfca
docs(CLAUDE.md): third pass — close API-list completeness gaps, fix t…
Nucs Jun 20, 2026
f8de515
perf(cast): bit-exact AVX2 f32/f64 -> u32 (Wave 15) — 16 cells 0.46-0…
Nucs Jun 20, 2026
5010263
perf(cast): route f16->u32 and c128->u32 through the AVX2 f64/f32->u3…
Nucs Jun 20, 2026
b551ddc
perf(cast): bit-exact vectorized f16 -> f32 widen (Wave 15c) — 8 cell…
Nucs Jun 20, 2026
fc3ccac
feat(bench): promote layout/cast/fusion harnesses from poc into run_b…
Nucs Jun 20, 2026
81411b6
perf(cast): bit-exact AVX2 {f16,f32,f64,c128} -> u64 (Wave 15d)
Nucs Jun 20, 2026
d8e4e32
perf(cast): bit-exact vectorized f16 -> i64 (Wave 15e) — 8 cells 0.73…
Nucs Jun 20, 2026
08af8d4
bench(cast): fix UTF-8 piping in bench_common + refresh scoreboard at…
Nucs Jun 20, 2026
2214ad9
docs(cast): record Wave 15 cast kernels — float/c128->u32/u64, f16->f…
Nucs Jun 20, 2026
3c5af96
docs(cast): continuation plan for the remaining 133 <0.9 cells
Nucs Jun 20, 2026
c8bf260
bench(layout): POC operand/extra layout classes + harmonize subsystem…
Nucs Jun 20, 2026
0e6e437
bench(operand): promote operand/broadcast layout POC into a run_bench…
Nucs Jun 20, 2026
9e1ad2c
docs(NDIter): third final-state pass — accuracy fixes in Iteration Me…
Nucs Jun 20, 2026
e7ac0fd
bench(fusion): add operand-layout sweep to the fusion gate (C/F/T/str…
Nucs Jun 20, 2026
32353c3
perf(cast): SIMD deinterleave/reverse for same-type sub-word strided …
Nucs Jun 20, 2026
7bfbc15
perf(cast): extend SubwordCopy to same-size cross-type bit reinterpre…
Nucs Jun 20, 2026
c95a752
perf(evaluate): fix np.evaluate F-order/transpose cliff (~15x) via F-…
Nucs Jun 20, 2026
1f42246
bench: review fixes — stale operand labels + fusion out= apples-to-or…
Nucs Jun 20, 2026
4a77991
perf(cast): SIMD 2-byte-int -> 1-byte/bool strided narrowing (Wave 16c)
Nucs Jun 20, 2026
ba97ce4
docs(cast): record Wave 16 sub-word strided SIMD shuffles (CLAUDE.md …
Nucs Jun 20, 2026
395bd15
bench(cast): refresh scoreboard at HEAD (clean foreground run, Wave 1…
Nucs Jun 20, 2026
f21876c
perf(cast): SIMD 1-byte-int -> 2-byte strided widening (Wave 16d)
Nucs Jun 20, 2026
7697ca2
perf(reduce): collapse broadcast axes algebraically in flat reduction…
Nucs Jun 20, 2026
f9b8c6b
feat(sort): implement np.sort + np.argsort + ndarray.sort on NDIter w…
Nucs Jun 20, 2026
db7ae65
bench(cast): refresh scoreboard at HEAD incl. Wave 16d widening (repr…
Nucs Jun 20, 2026
4134bd6
docs(cast): record Wave 16d 1B->2B widening (CLAUDE.md + plan docs)
Nucs Jun 20, 2026
91a628b
perf+fix(any/all): SIMD bool/char reductions + fix byte/sbyte any() A…
Nucs Jun 20, 2026
9df5231
fix(reduce): root-fix NDIter strided sub-word accumulator drift; drop…
Nucs Jun 20, 2026
9ee4b46
perf(cast): SIMD i64/u64 -> f16 via clamp-sentinel + low-32 pack (Wav…
Nucs Jun 20, 2026
3e9ff65
perf(cast): kill the -1/2-stride gather in float/c128 -> u64 strided …
Nucs Jun 20, 2026
0cf43c4
perf(cast): SIMD f16 -> bool strided via deinterleave/reverse + magni…
Nucs Jun 20, 2026
49cdcd0
docs(cast): Wave 17 scoreboard refresh + bucket A-E plan writeup
Nucs Jun 20, 2026
a3149ab
perf(cast): SIMD c128 -> bool negcol via deinterleave-reverse of real…
Nucs Jun 20, 2026
73a927d
fix(sort): eliminate O(N^2) blowup on 1-D / axis=None sort & argsort
Nucs Jun 21, 2026
133cbc8
perf(engine): route 7 strided ops through the kernel instead of mater…
Nucs Jun 21, 2026
cbaaa57
fix(math): NumPy-exact integer reciprocal 1/0 (+ bool->int8) and clip…
Nucs Jun 21, 2026
f74d9f8
fix(nditer): 0-dim op_axes iterates once, not N times (root of sort O…
Nucs Jun 21, 2026
34fc20b
perf(binaryop): fix scalar-broadcast fast-path miss for weak-scalar l…
Nucs Jun 21, 2026
56ccf40
docs: remove shipped plan/audit/POC scratch from the nditer branch
Nucs Jun 23, 2026
405a9f1
docs(pr): refresh PR #611 changelog to current branch state (HEAD d19…
Nucs Jun 23, 2026
283e997
refactor(kernels): delete the 24 [Obsolete(error:true)] dead methods …
Nucs Jun 23, 2026
02b14d0
refactor: remove 10 confirmed-dead private/protected methods
Nucs Jun 23, 2026
bf9764f
bench: regenerate full NumSharp-vs-NumPy report (op-matrix + all 5 su…
Nucs Jun 23, 2026
8b33b0f
bench(history): archive the 2026-06-23 full-run snapshot (e3b7c268)
Nucs Jun 24, 2026
ca3e674
bench(history): committable history/<date>_<sha> snapshots + `latest`…
Nucs Jun 24, 2026
d331dda
Add NumPy-NumSharp string compatibility matrix
Nucs Jun 24, 2026
0a69a45
fix(complex): log1p(nan ± inf·i) → (+inf, nan) — port NumPy nc_log1p …
Nucs Jun 24, 2026
d8fbe91
chore: remove 2 loose dotnet-run scripts + 6 dev-path docs, scrub dan…
Nucs Jun 24, 2026
4948439
chore: remove benchmark/poc/ scratch tree (146 files) + scrub all pat…
Nucs Jun 24, 2026
ce31d99
refactor(fuzz): move oracle/ -> test/oracle/ (differential-fuzz corpu…
Nucs Jun 25, 2026
7865f9b
docs(claude): document the differential-fuzz pipeline + test/oracle/ …
Nucs Jun 26, 2026
f798c5a
chore(csproj): remove dead <Compile Remove> leftovers from the Regen-…
Nucs Jun 26, 2026
e276be5
refactor(regen): purge the legacy Regen template engine — inline temp…
Nucs Jun 26, 2026
38f1e80
fix(kernels): macOS/ARM64 signed-zero + integer-widening reduction pa…
Nucs Jun 26, 2026
3427729
perf(cumsum): reimplement axis cumsum as NDIter-driven add.accumulate…
Nucs Jun 26, 2026
39455b7
test(cumsum): extensive NumPy 2.4.2 parity coverage + fix 4 edge-case…
Nucs Jun 26, 2026
5b094ca
perf(boolean-mask): unify get+set onto one NDIter gather/scatter (kil…
Nucs Jun 26, 2026
21e6ea5
fix(boolean-mask): 1-D length-1 mask is a normal mask, not 0-D; add e…
Nucs Jun 26, 2026
2932886
chore(cleanup): remove repo trash — tracked .bak, empty test stub, de…
Nucs Jun 26, 2026
ad0fb95
chore(docs): drop committed DocFX build output + update stale iterato…
Nucs Jun 26, 2026
126d819
feat(print): NumPy 2.4.2 byte-exact array printing for NDArray.ToStri…
Nucs Jun 26, 2026
982a2cd
test(open-bugs): un-flag 177 fixed OpenBugs reproductions into regula…
Nucs Jun 26, 2026
5e7b57d
chore(cleanup): delete stale OpenBugs.md registry (orphaned, referenc…
Nucs Jun 26, 2026
5abf04e
fix(abs): exact NumPy-2.4.2 dtype= parity for absolute (complex magni…
Nucs Jun 26, 2026
076ff3f
test(print): harden NumPy-parity coverage with edge cases (subnormals…
Nucs Jun 26, 2026
6cbbcab
test(print): add integer sign-mode, bool summarization, threshold-dis…
Nucs Jun 26, 2026
230c034
fix(indexing): combined boolean + advanced indexing (mask mixed with …
Nucs Jun 26, 2026
1027ed1
test(edgecases): fix stale int8-overflow OpenBugs assertions + drop t…
Nucs Jun 26, 2026
648322d
fix(simd): probe Vector{N}.Round/Truncate/Floor/Ceiling at runtime; u…
Nucs Jun 26, 2026
e55395e
test(abs): expand absolute parity coverage 14 -> 33 (NumPy 2.4.2-prob…
Nucs Jun 26, 2026
6e4356e
fix(simd): runtime+type-aware rounding SIMD gate for the fused np.eva…
Nucs Jun 26, 2026
9c2d5b2
feat(indexing): leading k-D mask + basic indexing; exhaustive combine…
Nucs Jun 26, 2026
d34cb68
docs(releases): backfill v0.4.0-alpha1 release notes
Nucs Jun 26, 2026
02f2e5e
test(indexing): commit boolean-mask + combined-index probe matrices a…
Nucs Jun 26, 2026
7b7ae47
docs(indexing): handover for the multiple-advanced-indices + slice ax…
Nucs Jun 26, 2026
bb220e3
docs(perf): handover for the remaining performance/pathing gaps
Nucs Jun 26, 2026
409aef9
feat(indexing): NumPy-correct axis placement for >=2 advanced indices…
Nucs Jun 26, 2026
610c618
perf(shift): route left_shift/right_shift through the unified binary …
Nucs Jun 26, 2026
6db4a91
perf(sort): insertion-sort fast path for short lines + dtype-aware sc…
Nucs Jun 26, 2026
81d80b5
refactor(cast): unify copy/retype/cast on NDIter, remove dead CastTo …
Nucs Jun 26, 2026
27e9a32
perf(shift): SIMD variable-shift inner loop via NDIter Tier-3B (strid…
Nucs Jun 26, 2026
017fc45
perf(sort): single-pass multi-histogram radix (read keys once, build …
Nucs Jun 26, 2026
efc518d
test(indexing): fix stale assertion + un-mark 3 broadcasted-source [O…
Nucs Jun 26, 2026
438b247
test(indexing): fix corrupted Masking_2D_over_3D (used a stray field …
Nucs Jun 26, 2026
2543f03
perf(reduce): SIMD f64/f32 min/max on broadcast/negative-stride axes …
Nucs Jun 26, 2026
5c802e9
test(indexing): strengthen no-op assertion in IndexNDArray_Get_Case4_…
Nucs Jun 26, 2026
dd3e7e7
refactor(kernels): centralize cache counters into public GeneratedDel…
Nucs Jun 26, 2026
74af66a
docs(sort): handover for IL-generated radix (#1) + AVX-512 vectorized…
Nucs Jun 26, 2026
735a7c1
fix(indexing): 0-D advanced index reduces its axis like a scalar int …
Nucs Jun 26, 2026
f40f689
refactor(kernels): extend GeneratedDelegates to every generated-kerne…
Nucs Jun 26, 2026
6ae9786
fix(indexing): recognize raw bool[] / bool[,] as a boolean mask (NumP…
Nucs Jun 26, 2026
0c20e2d
refactor(indexing): generalize bool-mask normalization via interfaces…
Nucs Jun 26, 2026
3d9e766
docs(indexing): handover plan to refactor get[]/set[] to 1-to-1 NumPy…
Nucs Jun 26, 2026
7ac3a1c
perf(reduce): raw Avx.Min/Max for contiguous f64/f32 flat min/max — 0…
Nucs Jun 26, 2026
cad6b5b
docs(indexing): lock int[] decision to (B) full-NumPy fancy; record b…
Nucs Jun 27, 2026
3f925c7
docs(perf): handover — raw Avx.Min/Max vs Vector256.Min/Max NaN-fixup…
Nucs Jun 27, 2026
b4c7158
refactor(indexing): raw int[]/long[] sole-index is FANCY (NumPy parit…
Nucs Jun 27, 2026
25ba412
fix(ufunc): direct binary maximum/minimum/fmax/fmin kernel; fix fmax/…
Nucs Jun 27, 2026
6479c10
perf(clip): lean scalar-bound dispatch + 4x-unrolled SIMD kernel
Nucs Jun 27, 2026
4c72ae5
test(indexing): extensive getter/setter NumPy-parity matrices + proof…
Nucs Jun 27, 2026
ba13eb8
perf(reduce): raw Avx.Min/Max in SimdMinMaxSameType (drop redundant N…
Nucs Jun 27, 2026
e25f038
docs(perf): record T1-T4 outcomes + the allocation-is-the-real-lever …
Nucs Jun 27, 2026
290017b
fix(indexing): accept int8 (SByte) fancy-index dtype — NumPy parity
Nucs Jun 27, 2026
eb105f7
fix(indexing): accept uint64 (ulong) scalar index — NumPy parity
Nucs Jun 27, 2026
d92a676
feat(indexing): support ITuple and IEnumerable index inputs — NumPy/C…
Nucs Jun 27, 2026
ddcfb69
fix(indexing): validate per-axis integer OOB in coordinate access (Nu…
Nucs Jun 27, 2026
956e256
fix(indexing): validate fancy negative-OOB index (memory-safety, NumP…
Nucs Jun 27, 2026
2ebe451
fix(indexing): broadcast value into >=2-D fancy subspace on assignmen…
Nucs Jun 27, 2026
f54ec3c
fix(indexing): single fancy index on a non-contiguous view now gather…
Nucs Jun 27, 2026
ad7c41a
fix(indexing): broadcast multi-fancy index arrays by shape, not size …
Nucs Jun 27, 2026
47c1a1d
fix(indexing): 0-d boolean combined with basic indices keeps source a…
Nucs Jun 27, 2026
065d0c5
fix(indexing): raise on over-indexing with slices (too many indices, …
Nucs Jun 27, 2026
f27a17b
fix(indexing): validate broadcast on assignment, reject partial write…
Nucs Jun 27, 2026
f3f145c
test(indexing): un-mark [OpenBugs] Compare (boolean-mask scalar set n…
Nucs Jun 27, 2026
8be4c83
fix(indexing): SetData(NDArray, long[]) broadcasts + validates like t…
Nucs Jun 27, 2026
dbbbd52
fix(indexing): reject too-many advanced indices instead of OOB (memor…
Nucs Jun 27, 2026
359e21f
docs(indexing): handover for full combinatorial advanced-indexing parity
Nucs Jun 27, 2026
a9399f5
test(indexing): commit differential index oracle as a FuzzMatrix gate…
Nucs Jun 27, 2026
77a154e
fix(indexing): bounds-guard the fancy gather/scatter block copies + o…
Nucs Jun 27, 2026
0f76b7e
feat(indexing): port NumPy prepare_index as the up-front validation g…
Nucs Jun 27, 2026
1a294dc
fix(slice): negative-step out-of-range start clamps to -1 (empty), not 0
Nucs Jun 27, 2026
3c803d2
fix(indexing): unify single+multi advanced gather through the grid; n…
Nucs Jun 27, 2026
8aa0aee
fix(indexing): empty index array of any dtype is an empty integer fan…
Nucs Jun 27, 2026
16b5488
fix(indexing): empty value into a non-empty whole-array region raises…
Nucs Jun 27, 2026
61e9657
feat(indexing): 0-d bool joins the advanced block in the grid (HAS_0D…
Nucs Jun 27, 2026
be40609
fix(indexing): setter pure-basic slice path was missing a return (fel…
Nucs Jun 27, 2026
d4b500f
fix(indexing): ellipsis must not count a 0-d bool as an axis-consumin…
Nucs Jun 27, 2026
b0e8414
fix(indexing): non-subshaped fancy assignment validates value broadca…
Nucs Jun 27, 2026
6540d69
fix(indexing): negative scalar-index assignment wrote out of bounds (…
Nucs Jun 27, 2026
61623af
docs(indexing): record handover execution progress (697 -> 64 diverge…
Nucs Jun 27, 2026
ad4a307
fix(indexing): route ALL advanced tuples through the grid, incl. pure…
Nucs Jun 27, 2026
05337cb
docs(indexing): expand handover remaining-work into an actionable con…
Nucs Jun 27, 2026
17c0f21
fix(indexing): value must broadcast to EMPTY/scalar selection on assi…
Nucs Jun 27, 2026
4af9860
fix(indexing): a 0-d (scalar) array rejects axis-consuming single ind…
Nucs Jun 27, 2026
3720ba2
fix(indexing): empty advanced indices gather an empty result, not throw
Nucs Jun 27, 2026
835b23c
fix(indexing): basic assignment into an EMPTY slice selection is a no…
Nucs Jun 27, 2026
612723a
fix(indexing): non-consecutive 0-d-bool placement + a Shape.Broadcast…
Nucs Jun 27, 2026
d4a07fb
test(indexing): pin the combinatorial-parity fixes; fix multi-dim emp…
Nucs Jun 27, 2026
2591e63
docs(indexing): record 697->0 divergences; R3 teardown OOB is the sol…
Nucs Jun 27, 2026
301229b
refactor(iterators): rename the Npy* stack to ND* (NDIter/NDExpr/…)
Nucs Jun 27, 2026
d08c296
chore(blame): make .git-blame-ignore-revs functional + ignore the Npy…
Nucs Jun 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .agents/skills/np-function/SKILL.md
1 change: 1 addition & 0 deletions .agents/skills/np-tests/SKILL.md
256 changes: 199 additions & 57 deletions .claude/CLAUDE.md

Large diffs are not rendered by default.

191 changes: 191 additions & 0 deletions .claude/commands/np-function.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
name: np-function
description: Implement a NumPy np.* function in NumSharp with full API parity, optimizations, and variation coverage (NumPy 2.4.2 source of truth).
argument-hint: <np.function_name or description>
---

When user requests /np-function, you are to follow these instructions carefully!:

# np-function command

We are looking to support NumPy's np.* to the fullest. we are aligning with NumPy 2.4.2 as source of truth and are to provide exact same API (np.* overloading) as NumPy does.
This session we focusing on: """$ARGUMENTS"""
You job is around interacting with np.* functions - no more than one unless they are closely related.

np.* / function's high-level development cycle is defined as follows:

## 1. Read, investigate, learn and experiment
Read how NumPy (src\NumPy\) implemented the np functions you are about to implement - noting all parameters and overloads.
NumPy is the source of truth and if NumPy does A, we do A but in NumSharp's C# way.

### Definition of Done:
- At the end of step (1) step you understand to 100%:
- How the np function works internally in NumPy and reacts to inputs / parameters.
- What parameters the np function accepts and what modes the function works in.
- Understand what optimizations are used by NumPy and what optimizations can we use.
- Understand how would be the best integration to our existing infrastructure.
- Do we use ILKernelGenerator or NDIter to implement the loop.
- Do not implement struct kernel.
## 2. Implement np method/s
- Implement np methods to the fullest, integrating into our existing infrastructure and patterns.
- Our implementation might differ from NumPy's because NumPy uses C++ macros while we generate IL methods during runtime to achieve peak performance and cpu acceleration. But any input given to NumPy will produce same output with complete parity.
- Our implementation must provide same parameters as the NumPy function and support all dtypes NumSharp currently supports.
- Do not create a function per dtype/NPTypeCode or if-else/switch-case per dtype/NPTypeCode to call a specialized path.
- Do not use struct kernel pattern.
- Do utilize IL generation (ILKernelGenerator) and/or NDIter to implement the function, including fast paths.
- Any loops must be implemented via NDIter or via ILKernelGenerator.

## Tools:
### Asserting, Validating, Comparing, Experimenting and Probing
"dotnet run <<'EOFDOTNET'" and "python <<'EOFPYTHON'" both can be used to asserting, validating, comparing, test and confirm how behaviors, edge cases, parameter variations, happyflow, unhappyflow are acting based on given input/s.
These cli functions allow rapid development and experimentation.
Specifying '#:project' and other '#' with paths must be absolute path.

### Benchmarking
Use "dotnet run <<'EOFDOTNET'" and "python <<'EOFPYTHON'" to produce professional benchmarks.

#### Benchmarking Rules of Thumbs
- We must be at-least x1.5 as fast as NumPy at all variations of execution extensively and modes possible extensively (all dtypes, all parameters combinations, see "Variations for Asserting, Validating, Comparing and Experimenting").
- There is a reason towards why NumPy does

## Optimizations and Implementation
Our codebase uses and follows the following techniques:

### A. Specialization & code generation

- Runtime IL emission per cache key — DynamicMethod generates a kernel once per (op, dtypes, layout) and the JIT compiles it to native; subsequent calls hit a ConcurrentDictionary lookup.
- Per-startup SIMD width baking — VectorBits resolved once via IsHardwareAccelerated; the emitted IL targets exactly one of V128/V256/V512 with no runtime width branch.
- Layout-specialized kernel paths — Generate distinct kernels for SimdFull / SimdScalarLeft / SimdScalarRight / SimdChunk / General instead of one kernel with runtime layout branches; layout becomes part of
the cache key.
- Signature collapse for fast paths — Contig kernels drop stride/shape args; scalar-broadcast kernels take T scalar not T*; cuts indirection and shrinks the IL body.
- Helper-call vs inline-IL choice — When an op has a tidy generic-constrained C# helper (e.g. CumSumHelperSameType<T>), the kernel emits a single Call and lets the JIT inline; only complex bodies inline the
IL loop themselves.
- Negative cache for unsupported combos — _castUnsupported/_maskedCastUnsupported record dtype pairs that fail IL gen so retries are O(1) instead of re-attempting emission.

### B. Loop shaping

- 4x-8x unrolling with independent accumulators — Body processes 4-8 vectors per iter into 4-8 separate accumulators; breaks the carried dependency so the CPU dispatches 4-8 SIMD ops/cycle.
- Three-stage loop — Unrolled SIMD body + 1-vector remainder + scalar tail; handles any count without padding.
- Inner-contig runtime dispatch — Inside strided kernels, compare each operand's stride to its element size; branch into the SIMD inner body when all match, else strided.
- Cache-friendly loop ordering — IKJ in MatMul so the inner SIMD walk is over sequential B[k,:] memory; A[i,k] is broadcast once and reused across all j.

### C. SIMD primitives

- Mask→uint via ExtractMostSignificantBits — Convert a Vector mask to packed bits in a uint — the universal building block for All/Any/NonZero/CountTrue/CopyMasked.
- Bit-scan loop (TrailingZeroCount + bits &= bits-1) — Materialize lane indices from a packed mask one-at-a-time without a per-lane branch; standard idiom for sparse-extract.
- Self-equality NaN mask — Equals(v, v) produces lanes that are true for non-NaN (NaN ≠ NaN); used to zero/count out NaNs in NaN-aware reductions.
- Branchless ConditionalSelect — Per-lane gating without a branch; used by Where and masked cross-dtype copy.
- Scalar pre-broadcast — Vector.Create(scalar) hoisted into a local before the loop so the body re-uses it instead of reloading; used by scalar-broadcast variants of binary/where/clip.
- Op-specific identity seeding — Reduction accumulators are pre-loaded with 0 (Sum), 1 (Prod), MinValue (Max), MaxValue (Min) — also defines the empty-array result.
- Tree merge + horizontal halving — Multi-accumulator finalization: acc0 op= acc1; acc2 op= acc3; acc0 op= acc2, then horizontal reduce across the lanes.
- Early-exit on mask state — All/Any/IsAllZero return immediately when the packed bits hit the terminal pattern, skipping the rest of the array.
- Vectorized index discovery, scalar scatter — Even when the data store can't be vectorized (gather/scatter limits), the mask scan that finds the indices is fully SIMD.
- AVX2 gather for strided float/double — Strided axis reductions use intrinsic gather when the dtype is gather-capable.
- Width-adaptive emit via GetVectorContainerType() — One emission function picks Vector{128|256|512} methods through a cache; the same source code path covers all widths.

### D. Memory & pointer

- Cpblk IL intrinsic — Same-type contiguous copy emits the CLR block-memcpy opcode directly instead of a loop.
- Incremental coord advance — Outer-dim walks update offsets by adding strides rather than recomputing via flat → div/mod per element.
- Pre-computed dim strides in stack array — Axis kernels pre-build output-dim strides on the stack so each output index → input offset is O(ndim) muladds, no divmods.
- Pointer/stride prologue hoisting — Inner-loop factory snapshots dataptrs[i] and strides[i] into locals once at the top so the loop body works against locals, not memory loads.
- Pre-size-then-fill — np.nonzero runs an IL-emitted popcount first to size the output buffer, then a second IL-emitted bit-scan kernel writes indices; avoids the "alloc max-size temp" pathology.

### E. Algorithmic

- Two-pass algorithms — ArgMax (find value → find index), Var/Std (mean → squared diffs), masked-copy (count → place). First pass enables vectorization; second pass exploits the known result.
- Monotonic-bound carry — searchsorted carries the lower bound L from the previous iteration when consecutive keys ascend, mirroring NumPy's binsearch.cpp.
- Short-circuit prescan — Quick SIMD all-zero check on a boolean mask short-circuits the whole np.where(cond) pipeline when the condition is fully false.
- Type-promotion-aware path skip — SIMD reduction skipped when input != accumulator (e.g. sum(int32)→int64) because Vector<T> can't widen lanes; falls to scalar IL.
- Two-tier inner-loop API — Callers choose between Tier 3A (raw IL body) for full control or Tier 3B (scalar/vector body lambdas wrapped in the standard 4×-unrolled shell) for boilerplate elimination.

### F. Cross-type bridging

- Decimal-via-double bridge — All transcendental decimal ops emit decimal→double→Math.*→decimal inline IL.
- Bool-mask lane expansion — 1-byte mask is widened through WidenLower chain to match the 1/2/4/8-byte data lane width before ConditionalSelect.
- Magnitude comparison for Complex — ArgMax/ArgMin on Complex compares |z|, since Complex has no native ordering.

### F. NumPy semantic compliance

- NumPy-overflow shift semantics — Branch on shift >= bitWidth returns 0 (or -1 for signed-negative right shift) instead of C# x << (n & 31) masking.
- Sign-preserving zero in Modf — Explicit fixup so modf(-0.0) = (-0.0, -0.0) and modf(+inf) = (+0.0, +inf) per C standard.
- Vacuous truth for empty reductions — all([])=True, any([])=False, identity-valued Sum/Prod/Max/Min for empty arrays.
- NEP50-aligned accumulator types — Reduction kernels promote int32→int64 for Sum/Prod/CumSum, dropping out of SIMD when needed.

### G. Reflection & caching

- MethodInfo cache (fail-fast at type load) — Math.*, Vector*.*, Decimal.* reflection resolved in static initializers with ?? throw; emission never pays GetMethod cost.
- Width-resolved generic method cache — VectorMethodCache.V(VectorBits, clrType) returns the right Vector{W}<T> type and Generic(VectorBits, name, clrType, paramCount) returns the right method handle.
- ConcurrentDictionary.GetOrAdd keyed by structural value — All kernel caches use struct keys with stable Equals/GetHashCode; thread-safe lazy init via GetOrAdd.


## Variations for Asserting, Validating, Comparing and Experimenting
These variations are the range of possabilities of inputs that we need to follow NumPy's output based on inputs for complete parity.
Total: ~44 distinct variations — 25 single-array layouts, 6 pairwise paths, 8 per-operand flags, 8 iteration flags, 4 composite execution paths.

### A. Single-array layouts

- C-contiguous — Row-major, stride[-1]==1 and stride[i]==shape[i+1]*stride[i+1]; baseline fast path via IsContiguous.
- F-contiguous — Column-major, stride[0]==1; 1-D arrays are both. Detected via IsFContiguous.
- Strided / non-contiguous — Arbitrary strides, neither C nor F; built via step slicing or axis swap.
- Transposed — Strides permuted by .T / swapaxes / moveaxis; usually non-contig.
- Negative-stride view — Reversed slicing ([::-1]); strides are signed-negative.
- Simple slice — offset!=0, not broadcast; fast GetOffsetSimple path (IsSimpleSlice).
- Sliced + composed — a[1:5].T, a[1:3][:,None,:]; offset combined with permutation or broadcast.
- Broadcasted — stride=0 with dim>1 (BROADCASTED flag); read-only per NumPy.
- Scalar-broadcast — All strides zero (IsScalarBroadcast); load value once and reuse.
- Partial broadcast — Some axes stride=0, others not; common (1,N)→(M,N) case.
- Scalar (0-d) — ndim==0, size==1, no strides.
- 0-D view from integer indexing — a[0,0,0] shares storage; distinct from np.array(5.0) which owns.
- 1-element 1-D — ndim==1, size==1; ambiguous against 0-d in some paths.
- Empty — size==0 (e.g. np.zeros((0,3))); reductions must return identity.
- Empty + composed — np.zeros((0,3))[::2,:]; rare but must not crash.
- NewAxis-inserted dim — a[None,:] adds dim=1, stride=0; not flagged broadcast since dim=1.
- Singleton dim (dim=1) — Stride is moot; NumPy treats as contig.
- Higher-rank (5+D) — Stack-allocated coord/stride arrays in kernels may have bounds.
- Stride > bufferSize — Negative-stride views can have offset + stride*(dim-1) >= bufferSize.
- Reshape view vs copy — Reshape returns a view when contig allows, materializes otherwise.
- Fancy-indexed result — Always a fresh C-contig owning array, never a view.
- Boolean-mask result — Always a contig owning copy.
- Read-only / non-writeable — IsWriteable==false (set on broadcast views); writes throw.
- Non-owning view — OwnsData==false; writes alias the parent.
- Aligned — ALIGNED flag; always true for managed allocs but a real NumPy axis.

### B. Pairwise (binary-op) paths — MixedTypeKernelKey.Path

- SimdFull — Both operands C-contig same dtype; SIMD baseline.
- SimdScalarRight — RHS is 0-d / scalar-broadcast, LHS is array.
- SimdScalarLeft — LHS is 0-d / scalar-broadcast, RHS is array.
- SimdChunk — Inner dim contig for both, outer strided.
- General — Arbitrary strides on either side; coordinate iteration.
- Mixed dtypes — Orthogonal axis: same layout, different LHS/RHS/result dtypes (NEP50 promotion).

### C. Per-operand variations — NDIterOpFlags

- Aliased operands — Same buffer on both sides (a + a, out=a); no non-aliasing assumption.
- Overlapping views — Two views with partial overlap (a[1:] and a[:-1]); writes can clobber unread reads.
- In-place output (out=) — Output aliases an input; loop order must respect read-before-write.
- Reduction operand — Output has stride=0 along the reduction axis (REDUCE flag).
- Write-masked operand — WRITEMASKED: write only where mask (ARRAYMASK) is true. Enforced ONLY at buffered copy-back (NumPy parity); unbuffered = kernel contract.
- Virtual operand — VIRTUAL: null operand, allocate-equivalent in NumPy 2.x (real backing array, dtype request discarded → common dtype).
- Buffered / casting operand — CAST / FORCECOPY / HAS_WRITEBACK: type conversion needs a temp.
- Read-only operand — READ without WRITE; matters for output selection.

### D. Iteration-level variations — NDIterFlags

- Coalesced dimensions — Consecutive axes with matching strides collapsed; ndim=4 may arrive as ndim=1.
- IDENTPERM vs NEGPERM — Axis iteration order: identity vs flipped (negative stride on some axis).
- External loop (EXLOOP) — Kernel sees only the inner axis; outer loop driven by iterator.
- Ranged iteration (RANGE) — Partial traversal of a subset.
- GROWINNER — Inner-loop length varies across outer iterations.
- GATHER_ELIGIBLE — Strided inner axis but dtype supports AVX2 gather.
- Early exit — short-circuit (All/Any/IsAllZero) is a KERNEL property (`SupportsEarlyExit`/`ShouldExit`), not an iterator flag.
- PARALLEL_SAFE — iteration range splittable across workers: no REDUCE operand, ≤1 WRITE operand with COPY_IF_OVERLAP-resolved overlap (`IsParallelSafe`).

### E. NDIter composite execution paths

- Source-broadcast + dest-contig — Common reduction shape.
- Source-contig + dest-strided — Writing into a sliced output.
- Buffer-required path — Dtype mismatch or alignment forces NDIter to insert a temp; kernel sees contig but indirect.
- Reused reduce loops — REUSE_REDUCE_LOOPS: inner-loop kernel runs against successive output positions without re-derivation.

37 changes: 0 additions & 37 deletions .claude/skills/np-function/SKILL.md

This file was deleted.

11 changes: 7 additions & 4 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# TUnit to MSTest v3 attribute migration
ac020336
ac020336afa4d1b9bb36f3d114c04d79f739159e

# TUnit assertions to AwesomeAssertions migration
b1d1d543
b1d1d54333c555dd29f7182e9317da8cae1cccd8

# Add [TestClass] attributes for MSTest discovery
e0db3c3e
e0db3c3ed2bac06c255d4b1b7a327efc4e05447b

# Convert async Task to void for sync test methods
4eea9644
4eea9644c8eaade8d2db1322bc6579023133898b

# Rename Npy* iterator/expression stack to ND* (mechanical, no logic change)
301229b6100e23aa3d8ab308de3bb543d50507dd
Loading
Loading