WIP: PERF: OOC-optimized algorithm variants for 30+ filters#1575
Draft
joeykleingers wants to merge 29 commits into
Draft
WIP: PERF: OOC-optimized algorithm variants for 30+ filters#1575joeykleingers wants to merge 29 commits into
joeykleingers wants to merge 29 commits into
Conversation
3 tasks
838a49f to
f145122
Compare
7a5a3c7 to
5bc9a95
Compare
0e1310e to
afc4257
Compare
afc4257 to
f2e56a4
Compare
be9ed24 to
dd7119e
Compare
Adds the core out-of-core (OOC) architecture to simplnx: a plugin-
driven hook system that lets the SimplnxOoc plugin own OOC storage
policy while simplnx core supplies the mechanism. Core gains bulk I/O
primitives that work identically for in-core and OOC stores, a
DataIOCollection callback API for the plugin to drive format
selection / post-import finalization / recovery-file write
interception, a strided N-dimensional Extent type, an EmptyStringStore
placeholder, a MemoryBudgetManager singleton that cache subsystems
register against, a unified DispatchAlgorithm template, and a
redesigned Dream3dIO public API (LoadDataStructure family plus a
WriteRecoveryFile with optional UserDataFilePath redirect). The
chunk-method skeleton previously exposed on AbstractDataStore is
removed in favor of the bulk I/O primitives. 92 files changed,
+5954 / -2707 lines.
================================================================================
1. AbstractDataStore bulk I/O API
================================================================================
src/simplnx/DataStructure/AbstractDataStore.hpp,
src/simplnx/DataStructure/DataStore.hpp,
src/simplnx/DataStructure/EmptyDataStore.hpp,
src/simplnx/DataStructure/IDataStore.hpp
Two bulk I/O primitives on AbstractDataStore<T>:
Result<> copyIntoBuffer(usize startIndex, std::span<T>) const
Result<> copyFromBuffer(usize startIndex, std::span<const T>)
Plus a strided N-dimensional access pair driven by Extent:
virtual Result<> readExtent(const Extent&, std::span<T>) const = 0
virtual Result<> writeExtent(const Extent&, std::span<const T>) = 0
DataStore<T> implements the 3D fast path via std::memcpy on contiguous
X rows (ZYX tuple order). The memcpy branch is guarded behind
!std::is_same_v<T, bool> because std::vector<bool> has no .data()
accessor; bool falls back to per-tuple copy. 1D extents use a flat
strided walk. EmptyDataStore returns empty / no-ops.
StoreType is reduced to three values (InMemory, OutOfCore, Empty).
IsOutOfCore() checks StoreType directly. getRecoveryMetadata() is
pure virtual on IDataStore — DataStore returns an empty map (in-
memory stores have no backing file); EmptyDataStore throws, matching
the fail-fast pattern of every other data-access method on this
metadata-only placeholder.
The previously exposed chunk-method skeleton on AbstractDataStore
(loadChunk, getNumberOfChunks, getChunkLowerBounds,
getChunkUpperBounds, getChunkShape, getChunkSize, getChunkTupleShape,
getChunkExtents, convertChunkToDataStore) is removed; the bulk I/O
primitives subsume it. Callers across the codebase are migrated.
================================================================================
2. Plugin hook system (DataIOCollection / IDataIOManager)
================================================================================
src/simplnx/DataStructure/IO/Generic/DataIOCollection.{hpp,cpp},
src/simplnx/DataStructure/IO/Generic/IDataIOManager.{hpp,cpp},
src/simplnx/Core/Application.cpp,
src/simplnx/Utilities/DataStoreUtilities.{hpp,cpp}
DataIOCollection carries three plugin-registered callbacks that let
the SimplnxOoc plugin drive OOC policy without core knowing the
details:
FormatResolverFnc(DataStructure, DataPath, DataType, dataSizeBytes)
Decides storage format per array. The expanded signature lets the
resolver walk parents to determine geometry type, so it can force
in-core for unstructured / poly geometry arrays without caller-
side checks.
DataStoreImportHandlerFnc(DataStructure, hdf5File, importStructure,
EagerLoadFnc)
Post-import callback that finalizes placeholder stores after all
HDF5 objects are read. Handles in-core eager loading, OOC
reference stores, and recovery reattachment in one place. The
EagerLoadFnc callback lets the handler eager-load individual
arrays without knowing Dream3dIO internals.
WriteArrayOverrideFnc
Intercepts HDF5 writes during recovery-file creation so the
plugin can emit lightweight placeholder datasets instead of full
data. Activated via the RAII WriteArrayOverrideGuard, wired into
DataStructureWriter.
IDataIOManager gains factory registration for ListStoreRefCreateFnc,
StringStoreCreateFnc, and FinalizeStoresFnc, with delegating creation
methods on DataIOCollection. "Simplnx-Default-In-Memory" is a
reserved format name: addIOManager() rejects plugin registrations
under it (the core CoreDataIOManager is installed directly into the
map bypassing the guard). addIOManager() returns Result<> rather than
throwing.
DataIOCollection::generateManagerListString() produces a multi-line
capability matrix of every registered IO manager and the store types
it supports; the helper is surfaced in CreateArray's "unknown format"
error message so users see which formats are available when a request
fails. A format display-name registry (registerFormatDisplayName /
getFormatDisplayNames) supplies human-readable names for the
DataStoreFormatParameter dropdown — "Automatic", "In Memory", plus
plugin-registered formats with display names.
DataStoreUtilities::GetIOCollection() and
Application::getIOCollection() return DataIOCollection& rather than
std::shared_ptr — the collection is owned by the Application
singleton that outlives every caller, so a reference expresses non-
ownership clearly. WriteArrayOverrideGuard holds a DataIOCollection&
member; the "may be null no-op" path is dropped (no caller used it).
================================================================================
3. Extent type and strided N-D access
================================================================================
src/simplnx/Common/Extent.hpp,
test/ExtentTest.cpp
nx::core::Extent is a strided N-dimensional range struct (lower /
upper bounds, dimension count, per-dimension stride). It is the
parameter type for the new readExtent / writeExtent virtuals on
AbstractDataStore, giving the visualization layer one virtual call
per slab instead of hand-rolled per-row copyIntoBuffer loops.
The header #undef's min and max at the top so MSVC's transitively
included Windows.h macros don't expand against the struct's member
names in the constructor initializer list — without the guard, the
parser failure cascades into fmt/color.h namespace errors on v143
Windows builds.
ExtentTest covers construction, contains / overlaps / intersect,
equality, and strided semantics (82 assertions).
================================================================================
4. EmptyStringStore placeholder
================================================================================
src/simplnx/DataStructure/EmptyStringStore.hpp,
test/EmptyStringStoreTest.cpp,
src/simplnx/DataStructure/IO/HDF5/StringArrayIO.cpp
EmptyStringStore is the string-array analog of EmptyDataStore: a
placeholder that stores only tuple-shape metadata. All data-access
methods throw std::runtime_error; isPlaceholder() returns true (vs
false for StringStore). StringArrayIO creates an EmptyStringStore
in OOC mode instead of allocating numValues empty strings up-front,
deferring full-string materialization to the plugin's handler.
Tests cover metadata, zero tuples, throwing access, deep-copy
placeholder preservation, resize, and isPlaceholder.
================================================================================
5. Dream3dIO public API
================================================================================
src/simplnx/Utilities/Parsing/DREAM3D/Dream3dIO.{hpp,cpp},
test/Dream3dLoadingApiTest.cpp,
test/UnitTestCommon/include/simplnx/UnitTest/UnitTestCommon.{hpp,cpp}
The Dream3dIO public API is rebuilt around four purpose-specific
loaders:
LoadDataStructure(path) — full load with OOC handler support
LoadDataStructureArrays(path, dataPaths) — selective load with pruning
LoadDataStructureMetadata(path) — metadata-only skeleton (preflight)
LoadDataStructureArraysMetadata(path, dataPaths) — pruned metadata skeleton
This decouples pipeline loading from DataStructure loading, replaces
a bool-flagged preflight parameter with distinct functions, and
centralizes OOC handler integration in a single internal
LoadDataStructureWithHandler. The internal helpers
(LoadDataObjectFromHDF5, EagerLoadDataFromHDF5, PruneDataStructure,
LoadDataStructureWithHandler) move into an anonymous namespace. The
legacy ReadFile / ImportDataStructureFromFile / FinishImportingObject
entry points are removed.
ReadDREAM3DFilter calls LoadDataStructureMetadata at preflight and no
longer manages HDF5 file handles. ImportH5ObjectPathsAction calls
LoadDataStructureMetadata at preflight and LoadDataStructure at
execute; the action no longer manages file handles or deferred
loading itself and delegates to the registered
DataStoreImportHandlerFnc when present, or falls back to
FinishImportingObject for non-OOC builds.
WriteRecoveryFile() wraps WriteFile with WriteArrayOverrideGuard so
the plugin's placeholder-write hook fires. It also accepts an
optional userDataFilePath parameter: when set, the writer emits a
minimal HDF5 file containing only the file-version tag plus a root-
level "UserDataFilePath" string attribute. In that mode the
DataStructure and Pipeline parameters are ignored — the user's
authoritative .dream3d output (written by a trailing
WriteDREAM3DFilter in the pipeline) is the real data carrier. The
recovery scanner reads the attribute back at relaunch time to
redirect the load. Companion reader:
DREAM3D::ReadUserDataFilePathAttribute. Existing call sites are
source-compatible; the parameter defaults to std::nullopt.
UnitTest::LoadDataStructure and LoadDataStructureMetadata helpers
delegate directly to the new functions. Dream3dLoadingApiTest covers
all four loaders. Test callers (ComputeIPFColorsTest,
RotateSampleRefFrameTest, DREAM3DFileTest, H5Test) are migrated.
================================================================================
6. Format resolution at the array creation layer
================================================================================
src/simplnx/Utilities/DataStoreUtilities.{hpp,cpp},
src/simplnx/Utilities/ArrayCreationUtilities.hpp,
src/simplnx/Filter/Actions/CreateArrayAction.{hpp,cpp},
src/simplnx/Filter/Actions/ImportH5ObjectPathsAction.{hpp,cpp},
src/simplnx/Filter/Actions/Create{Vertex,1D,2D,3D}GeometryAction.{hpp,cpp},
src/simplnx/Filter/Actions/CreateRectGridGeometryAction.{hpp,cpp},
+ 12 filter call-sites across SimplnxCore and OrientationAnalysis
resolveFormat() is called from ArrayCreationUtilities::CreateArray
and ImportH5ObjectPathsAction; DataStoreUtilities::CreateDataStore
and CreateListStore are simple factories that take an already-
resolved format string.
CreateArrayAction carries a dataFormat member for per-filter format
override; when non-empty it bypasses the resolver. The dropdown shows
"Automatic", "In Memory", and any plugin-registered formats with
display names. 12 filter callers are fixed where fillValue was being
passed as dataFormat after a parameter reordering.
Geometry creation actions drop their createdDataFormat parameter and
materialize OOC topology arrays into in-core stores when source
arrays carry StoreType::OutOfCore — unstructured / poly geometry
topology must be in-core for the visualization layer.
ArrayCreationUtilities::CheckMemoryRequirement is a pure RAM check
and recognizes both the empty string and k_InMemoryFormat as in-core
sentinels (avoiding the prior overload where "" meant both "unset"
and "in-memory").
================================================================================
7. HDF5 I/O integration
================================================================================
src/simplnx/DataStructure/IO/HDF5/DataStoreIO.hpp,
src/simplnx/DataStructure/IO/HDF5/DataArrayIO.hpp,
src/simplnx/DataStructure/IO/HDF5/NeighborListIO.hpp,
src/simplnx/DataStructure/IO/HDF5/DataStructureWriter.cpp,
src/simplnx/Utilities/Parsing/HDF5/IO/DatasetIO.{hpp,cpp}
DataStoreIO::ReadDataStoreIntoMemory (the in-memory load path) gains
a placeholder-detection guard that compares physical HDF5 element
count against shape attributes and returns Result<> with a warning
when they mismatch — catching attempts to load placeholder datasets
without the OOC plugin. Return type becomes
Result<shared_ptr<AbstractDataStore<T>>> so callers can accumulate
warnings across arrays.
DataArrayIO::writeData always calls WriteDataStore. OOC stores
materialize their data through the plugin's writeHdf5(); recovery
writes flow through WriteArrayOverrideFnc.
NeighborListIO has an OOC interception path: it computes total
neighbor count, calls resolveFormat(), and creates a read-only
reference list-store when an OOC format is available. Legacy
NeighborList reading threads a preflight flag through the chain
(readLegacyNeighborList → createLegacyNeighborList → ReadHdf5Data) so
legacy .dream3d imports create EmptyListStore placeholders rather
than eagerly loading per-element via setList().
DataStructureWriter consults WriteArrayOverrideFnc first, giving the
registered plugin callback first chance to handle each data object.
DatasetIO adds explicit template instantiations of createEmptyDataset
and writeSpanHyperslab for all numeric types plus bool. The plugin's
AbstractOocStore::writeHdf5() cannot use writeSpan() (the full array
isn't in memory); it creates an empty dataset, then fills it region-
by-region via hyperslab writes as it streams from the backing file.
================================================================================
8. MemoryBudgetManager and Preferences
================================================================================
src/simplnx/Utilities/MemoryBudgetManager.{hpp,cpp},
test/MemoryBudgetManagerTest.cpp,
src/simplnx/Core/Preferences.{hpp,cpp}
MemoryBudgetManager is a unified, thread-safe singleton in simplnx
core. Cache subsystems (the plugin's ChunkCache, visualization stride
cache, etc.) register allocations against it, and it evicts globally
oldest entries via callbacks when memory pressure exceeds the
configured budget. Living in core lets non-OOC builds and
visualization code use it without a plugin dependency, and the 50%-
of-system-RAM default is computed in-place via
MemoryBudgetManager::defaultBudgetBytes() — no plugin-startup
override race.
Preferences gains the persisted "memory_budget_bytes" key and
memoryBudgetBytes() / setMemoryBudgetBytes(value) accessors. The
k_InMemoryFormat sentinel constant is added for explicit in-core
format choice; migration logic erases legacy empty-string and
"In-Memory" preference values. checkUseOoc() tests against
k_InMemoryFormat. setLargeDataFormat("") removes the key so plugin
defaults take effect.
forceOocData is no longer gated on m_UseOoc in either the getter or
the setter. The previous gate (returning true only when the selected
format was something other than the in-memory sentinel) silently
dropped force_ooc_data writes whenever the user's large-data format
was the default in-memory sentinel — defeating the toggle's intent.
The OOC format resolver already handles the (forceOoc=true,
userChoseInMemory=true) case correctly; removing the gate makes the
toggle behave as designed.
================================================================================
9. Algorithm infrastructure (AlgorithmDispatch / IParallelAlgorithm)
================================================================================
src/simplnx/Utilities/AlgorithmDispatch.hpp,
src/simplnx/Utilities/IParallelAlgorithm.{hpp,cpp},
test/IParallelAlgorithmTest.cpp,
CMakeLists.txt
AlgorithmDispatch adds ForceInCoreAlgorithm and ForceOocAlgorithm
global flags with RAII guards, plus a DispatchAlgorithm template that
selects between Direct (in-core) and Scanline (OOC) algorithm
variants based on involved stores' types and the force flags.
SIMPLNX_TEST_ALGORITHM_PATH is a new CMake option (0 = both,
1 = OOC-only, 2 = InCore-only) for dual-dispatch test control.
IParallelAlgorithm drops the blanket TBB-disable for OOC data — OOC
stores are thread-safe via the plugin's ChunkCache plus the HDF5
global mutex. CheckStoresInMemory and CheckArraysInMemory use
StoreType directly instead of getDataFormat().
IParallelAlgorithmTest covers TBB enablement across in-memory, OOC,
and mixed array / store combinations using a MockOocDataStore that
overrides readExtent / writeExtent and provides a no-op
getRecoveryMetadata.
================================================================================
10. Filter / utility callers updated for the bulk I/O API
================================================================================
src/Plugins/SimplnxCore/src/SimplnxCore/Filters/Algorithms/FillBadData.cpp,
src/Plugins/SimplnxCore/src/SimplnxCore/Filters/Algorithms/QuickSurfaceMesh.cpp,
src/Plugins/SimplnxCore/src/SimplnxCore/utils/VtkUtilities.cpp,
src/Plugins/SimplnxCore/src/SimplnxCore/Filters/WriteVtkRectilinearGridFilter.cpp,
+ ITK and OrientationAnalysis filters / tests
FillBadData::phaseOneCCL and phaseThreeRelabeling use Z-slab buffered
I/O via copyIntoBuffer / copyFromBuffer rather than the removed chunk
methods. operator()() scans feature counts in 64K-element chunks via
copyIntoBuffer.
QuickSurfaceMesh::generateTripleLines() no longer queries
getChunkShape() to set ParallelData3DAlgorithm chunk size.
VtkUtilities binary write reads into 4096-element buffers via
copyIntoBuffer, byte-swaps inside the buffer, and fwrites — replacing
direct DataStore data()-pointer access. WriteVtkRectilinearGrid
propagates Result<> errors from copyIntoBuffer / copyFromBuffer.
ITKArrayHelper / ITKTestBase OOC checks use getStoreType() instead of
getDataFormat().empty(). IsArrayInMemory collapses from a 40-line
DataType switch to a single StoreType check. ArraySelectionParameter
drops EmptyOutOfCore handling; only StoreType::Empty remains.
The bulk I/O APIs (DataIOCollection::addIOManager and the
AbstractDataStore copyIntoBuffer / copyFromBuffer family) return
Result<> rather than throwing std::runtime_error / std::out_of_range;
all call sites — Application::loadPlugin, the
Create{Vertex,1D,2D,3D}Geometry actions, WriteVtkRectilinearGrid,
FillBadData phases — early-return on error.
================================================================================
11. Tests
================================================================================
test/Dream3dLoadingApiTest.cpp (new)
test/ExtentTest.cpp (new)
test/IParallelAlgorithmTest.cpp (new)
test/MemoryBudgetManagerTest.cpp (new)
test/DataIOCollectionHooksTest.cpp (new)
test/EmptyStringStoreTest.cpp (new)
test/IOFormat.cpp (extended)
DataIOCollectionHooksTest exercises the format-resolver and data-
store-import-handler hooks. IOFormat tests cover the InMemory
sentinel, empty-format handling, and resolveFormat with / without a
registered plugin; the "Target DataStructure Size" tautology test is
removed.
RodriguesConvertorTest exemplar gains the missing expected values for
the 4th tuple (indices 12-15) — the old CompareDataArrays broke on
the first floating-point mismatch and masked the gap; the new
chunked comparison correctly continues past epsilon-close differences
and exposed it.
================================================================================
12. Style and incidental cleanup
================================================================================
docs/superpowers/{plans,specs}/2026-04-07-arraycalculator-memory-optimization*.md
(deletions),
Dream3dIO.cpp, ImportH5ObjectPathsAction.cpp, DataIOCollection.cpp,
H5Test.cpp, UnitTestCommon.cpp, DREAM3DFileTest.cpp,
ComputeIPFColorsTest.cpp
Two markdown files under docs/superpowers/ that were merged into
develop by mistake are removed. Seven .cpp files that spelled out
std::filesystem repeatedly gain the namespace fs = std::filesystem
alias for consistency with the existing codebase convention.
================================================================================
Verification
================================================================================
The squashed commit was not built or test-run in this session. The
squash introduces no code changes — `git diff <squashed-commit>
<original-branch-tip>` is empty (confirmed in step 8 of the squash
workflow).
Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Rename 13 algorithm files to their in-core variant names in preparation for adding OOC (out-of-core) dispatch alternatives. This enables git rename tracking so that subsequent optimization commits show proper diffs against the original algorithm code. Renames (SimplnxCore): FillBadData -> FillBadDataBFS IdentifySample -> IdentifySampleBFS ComputeBoundaryCells -> ComputeBoundaryCellsDirect ComputeFeatureNeighbors -> ComputeFeatureNeighborsDirect ComputeSurfaceAreaToVolume -> ComputeSurfaceAreaToVolumeDirect ComputeSurfaceFeatures -> ComputeSurfaceFeaturesDirect SurfaceNets -> SurfaceNetsDirect QuickSurfaceMesh -> QuickSurfaceMeshDirect DBSCAN -> DBSCANDirect ComputeKMedoids -> ComputeKMedoidsDirect MultiThresholdObjects -> MultiThresholdObjectsDirect Renames (OrientationAnalysis): BadDataNeighborOrientationCheck -> BadDataNeighborOrientationCheckWorklist No logic changes. InputValues structs and filter classes unchanged.
…ntationAnalysis
Replace per-element DataStore access with chunked bulk I/O
(copyIntoBuffer/copyFromBuffer) across 60+ algorithm files to eliminate
virtual dispatch overhead and HDF5 chunk thrashing when arrays are backed
by out-of-core storage.
--- Architecture ---
DispatchAlgorithm pattern (Direct/Scanline):
11 algorithms gain a base dispatcher class that selects between an
in-core Direct implementation and an OOC Scanline variant at runtime
based on IsOutOfCore()/ForceOocAlgorithm():
SimplnxCore: ComputeBoundaryCells, ComputeFeatureNeighbors,
ComputeKMedoids, ComputeSurfaceAreaToVolume, ComputeSurfaceFeatures,
DBSCAN, MultiThresholdObjects, QuickSurfaceMesh, SurfaceNets
OrientationAnalysis: BadDataNeighborOrientationCheck, ComputeIPFColors
ComputeGBCDPoleFigure dispatches directly from its filter executeImpl().
Connected Component Labeling (CCL) pattern:
4 algorithms gain a two-pass CCL variant as an OOC alternative to
random-access BFS/DFS flood-fill:
SimplnxCore: FillBadData (BFS/CCL), IdentifySample (BFS/CCL)
OrientationAnalysis: EBSDSegmentFeatures, CAxisSegmentFeatures
The CCL engine in SegmentFeatures::executeCCL() scans voxels in Z-Y-X
order with a 2-slice rolling buffer and UnionFind equivalence tracking,
giving sequential I/O access patterns. Supports Face and FaceEdgeVertex
connectivity with optional periodic boundaries.
--- New utility infrastructure ---
- UnionFind (src/simplnx/Utilities/UnionFind.hpp):
Vector-based disjoint set with union-by-rank and path-halving.
- SliceBufferedTransfer (src/simplnx/Utilities/SliceBufferedTransfer.hpp):
Z-slice buffered tuple transfer for propagating neighbor voxel data
used by ErodeDilate, FillBadData, MinNeighbors, and ReplaceElements.
- TupleTransfer batch API (Filters/Algorithms/TupleTransfer.hpp):
Batch bulk I/O methods for QuickSurfaceMesh and SurfaceNets mesh
generation attribute transfer.
- SegmentFeaturesTestUtils.hpp:
Shared test builder functions for segmentation filter test suites.
--- Bulk I/O conversions (existing algorithms) ---
Core utilities:
DataArrayUtilities (ImportFromBinaryFile, AppendData, CopyData,
mirror ops), DataGroupUtilities (RemoveInactiveObjects),
ClusteringUtilities (RandomizeFeatureIds), GeometryHelpers
(FindElementsContainingVert, FindElementNeighbors),
AlignSections (Z-slice OOC transfer path),
ImageRotationUtilities (source slab caching for nearest-neighbor),
TriangleUtilities (bulk-load triangles/labels for winding repair),
H5DataStore (streaming row-batch FillOocDataStore replacing full-
dataset allocation)
SimplnxCore algorithms:
AlignSectionsFeatureCentroid, ComputeEuclideanDistMap,
ComputeFeatureCentroids, ComputeFeatureClustering, ComputeFeatureSizes,
CropImageGeometry, ErodeDilateBadData, ErodeDilateCoordinationNumber,
ErodeDilateMask, RegularGridSampleSurfaceMesh, RequireMinimumSizeFeatures,
ReplaceElementAttributesWithNeighborValues, ScalarSegmentFeatures,
WriteAvizoRectilinearCoordinate, WriteAvizoUniformCoordinate
OrientationAnalysis algorithms:
AlignSectionsMisorientation, AlignSectionsMutualInformation,
ComputeAvgCAxes, ComputeAvgOrientations, ComputeCAxisLocations,
ComputeFeatureNeighborCAxisMisalignments,
ComputeFeatureReferenceCAxisMisorientations,
ComputeFeatureReferenceMisorientations, ComputeGBCD,
ComputeGBCDMetricBased, ComputeKernelAvgMisorientations,
ComputeTwinBoundaries, ConvertOrientations, MergeTwins,
NeighborOrientationCorrelation, RotateEulerRefFrame, WriteGBCDGMTFile,
WriteGBCDTriangleData, WritePoleFigure
EBSD readers:
ReadAngData, ReadCtfData, ReadH5Ebsd, ReadH5EspritData
--- Test infrastructure ---
- UnitTestCommon: ExpectedStoreType()/RequireExpectedStoreType() helpers,
TestFileSentinel reference-counted decompression, CompareDataArrays
rewritten with chunked bulk I/O for OOC-safe comparison.
- 29 test files updated with OOC dual-path testing:
ForceOocAlgorithmGuard + GENERATE(from_range(k_ForceOocTestValues))
runs every test case in both in-core and forced-OOC modes.
… bugs Add CreateResolvedDataStore utility that runs the IOCollection format resolver before creating a DataStore, matching the path filter actions use. Update test builder functions to call it so that test-constructed arrays become OOC stores when the OOC plugin is active. Fix three bugs in the OOC ComputeAvgOrientations Rodrigues average: - Allow featureId 0 in accumulation (matching architecture branch) - Start normalization loop from featureId 0 - Add missing continue for zero-count features to avoid divide-by-zero Fix stale GetIOCollection API call in UnitTestCommon (shared_ptr to ref). Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
…mized algorithms Adds extensive documentation across all out-of-core optimized filter algorithms explaining what each algorithm does and why the OOC variant works the way it does. Targets readers with no prior OOC knowledge. - Headers: Doxygen @Class, @brief, @param on all classes, methods, InputValues structs, and member variables - Source files: file-level overviews, Doxygen on operator()(), and inline comments explaining rolling windows, buffer strategies, dispatch logic, and OOC rationale - Filter docs: Algorithm sections with In-Core/Out-of-Core/Performance subsections added to ~45 filter markdown files - Key utilities: SliceBufferedTransfer.hpp and TupleTransfer.hpp documented as core OOC infrastructure
…writes WritePoleFigure was missed in the d4f2cce OOC-optimization sweep and was timing out (>300 s) on OOC-backed inputs under the OocOnly test path. Four Catch2 tests exercising it (Discrete, Discrete-Masked, Color, Color-Masked) were failing with Timeout on OOC-Release ctest. Hot paths replaced: 1. Cell-phase + Euler-angle gather loops (algorithm body). Two passes per phase over `numPoints` cells that used `phases[i]` + `eulerAngles[i * 3 + {0,1,2}]` per-element reads — one HDF5 hit per element on OOC stores, 4N hits per cell. Replaced with a chunk-sequential stream: 64K-tuple `std::vector` buffers filled via `copyIntoBuffer()` once per chunk; the inner loops iterate over the in-memory buffers. Peak auxiliary memory bounded to ~1 MB regardless of input size; NOT an O(N) bulk allocation. 2. Intensity-plot write-back (`std::copy(..., array.begin())` on three `Float64Array` outputs per phase). Per-element operator[] writes on OOC-backed `Float64Array` targets are one HDF5 write per pixel; a 512x512 image emits 262K hits per plot, ~786K per phase. Replaced with `copyFromBuffer()` per output array — one write per image. 3. Composite RGB image pack (for `figures.size() == 3` phases). The loop that interleaved RGBA source bytes into RGB-packed `UInt8Array` output wrote three `imageData[...] = ...` elements per pixel. Now builds the packed RGB buffer in a local `std::vector<uint8>` first, then emits one `copyFromBuffer()` call for the whole image. No algorithm changes. `metaDataArrayRef[phase] = ...` kept as-is — that's a single write keyed by phase count, not a hot loop. Test results on OOC-Release (InCore continues to pass — no in-core regression): WritePoleFigureFilter-Discrete 2.88 s (was: Timeout) WritePoleFigureFilter-Discrete-Masked 4.39 s (was: Timeout) WritePoleFigureFilter-Color 3.29 s (was: Timeout) WritePoleFigureFilter-Color-Masked 4.74 s (was: Timeout) All four now pass within the 300 s ctest timeout. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Add download_test_data entries for fill_bad_data_exemplars.tar.gz and identify_sample_exemplars.tar.gz to SimplnxCore test CMakeLists * Add download_test_data entries for segment_features_exemplars.tar.gz to both SimplnxCore and OrientationAnalysis test CMakeLists * Fixes CI failures where FillBadData, IdentifySampleFilter, ScalarSegmentFeatures, CAxisSegmentFeatures, and EBSDSegmentFeatures tests could not locate their exemplar archives Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Replace per-row copyIntoBuffer/copyFromBuffer pair (one pair per (z,y) scanline) with a K=32 Z-slice batched slab I/O pattern * Read K full source Z-slices per bulk call, extract the crop region via std::memcpy per row, and write K destination Z-slices in one bulk call * Working-set bound is O(n^(2/3)): (k_ZSliceBatch + 2) * X * Y * numComps * sizeof(T) bytes Tests: 16/16 pass on OOC build. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Remove the O(n) dense neighborsVoxelIndex array in favor of sparse parallel vectors (changedVoxels + neighborVoxelIdxs), saving ~16 GB on CT_align-scale volumes * Hoist slabBuf above the convergence while-loop so it is no longer re-allocated every iteration * Delete dead class RequireMinimumSizeFeaturesTransferDataImpl * Add ChunkedTransferWorker<T> doing Z-batched bulk I/O for the transfer phase, dispatched via ExecuteParallelFunction with type-based dispatch and parallelized per cell-level array via ParallelTaskAlgorithm; 64 MB/task/array slab budget Tests: 1/1 pass. Unit-test time unchanged at 0.23 s. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Replace the dense triNewIndex (8 B per triangle) with a triMask bitset (1 bit per triangle) plus a sparse triPrefixSum popcount table for ~6x memory reduction on the triangle side * Keep vertNewIndex as a dense 8 B per vertex map to preserve the invariant that triangle 0's three fresh vertices are assigned compact indices 0, 1, 2 (other filters depend on this ordering) * Stream all passes with chunked bulk I/O; vertex copies use bulk source reads + per-vertex dest writes (required by the ordering invariant), triangle copies use bulk reads and bulk writes Tests: 4/4 pass. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Mirror the nearest-neighbor slab-cache pattern in the trilinear interpolation path with a +/-2 Z-slice margin so all 8 corner neighbors remain resident, avoiding per-voxel random reads * Slide the slab cache window when consecutive output slices shift the needed source Z range — memmove the surviving slices in the buffer and read only the delta slices instead of re-reading the full slab * Parallelize the inner output-row loop for both the nearest-neighbor and trilinear paths via ParallelDataAlgorithm; threads share the cached slab (read-only) and write disjoint Y-row ranges of the local output slice buffer * Replace per-element at()/setValue() calls in the node-geometry convert path with 16 K-vertex chunked copyIntoBuffer/copyFromBuffer bulk I/O CT_align (1.97 B-voxel trilinear rotation): 133 s -> 20 s (~6.6x). Tests: 14/14 pass on both in-core and OOC builds. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Replace per-triangle getFaceCoordinates() random reads with a chunked pipeline: bulk-read 65K triangle connectivity indices per pass, determine the referenced vertex-index span, and bulk-load that vertex range into a local buffer * Parallelize the area compute on the local buffer (reads/writes touch plain C++ arrays only, so threads are safe — no DataStore access inside the parallel region) * Bulk-write the chunk's area output in one call * Guard against pathological meshes whose vertex indexing spans more than 16M entries per chunk with a serial per-triangle fallback; filter-generated meshes stay well under this cap CT_align (mesh-scale triangle areas): 26 s -> <1 s (~26x). Tests: 1/1 pass on both in-core and OOC builds. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
… inputs Bump the FeatureId (and RectGrid element-size) bulk-read chunk size from 64K to 256K tuples. The voxel-counting pass is I/O-bound on OOC-backed stores; larger chunks reduce copyIntoBuffer() round-trip overhead on datasets with tens of thousands of chunks while keeping per-chunk working-set memory bounded (1 MB for the int32 buffer, and an additional 1 MB for the float32 element-size buffer on the RectGrid path). CT_align (1.97 B voxels, Image path): 14 s -> 13 s. Tests: 9/9 pass on the OOC build. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Rewrite the markdown Algorithm section to explain the crop as a 3D subarray copy from first principles, teach the Z-slice-batched bulk I/O strategy step-by-step, and quantify why batching by K Z-slices collapses HDF5 chunk-op overhead * Add a Doxygen block on CropImageGeomDataArray describing the per-pass pipeline (bulk read slab -> in-memory extract -> bulk write) and the O(slab), non-O(volume) memory bound Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Rewrite the Algorithm section so a reader unfamiliar with the filter can follow the two-phase pipeline end-to-end: * Phase 1 (feature removal): motivate why small features get pruned, describe the 64K-tuple chunked scan, and explain the "skip write when chunk unchanged" optimization * Phase 2 (gap fill by majority-vote): teach the rolling 3-slice buffer scan, the sparse parallel vectors that replace the old O(n) dense index array, the per-array ChunkedTransferWorker with its +/-1 Z-margin slab read + interior-only write-back, and the outer ParallelTaskAlgorithm across arrays * Add a memory-footprint summary clarifying that every data structure is O(slice) or O(iteration bad count), never O(volume) Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add a new Algorithm section that teaches the filter from scratch: * Explain conceptually which triangles are kept (all three vertices inside the user-specified node-type range) and what the output geometry looks like (compact vertex list, compact triangle list, remapped connectivity) * Document the downstream-invariant that forces vertNewIndex to stay a dense per-vertex map (triangle 0's three fresh vertices land at new indices 0..2 in traversal order) * Explain the triMask bitset + triPrefixSum sparse popcount table that replaces the legacy dense triangle map for ~6.4x memory savings, and how remapIndex() turns an O(1) table lookup plus a small popcount into each triangle's compact new index * Walk the six streaming passes (vertex-ok mask, triangle scan + vertex-index assignment, prefix-sum build, vertex copy, triangle remap copy, per-vertex/per-triangle attached-array copy) * Summarize the memory footprint so the vertNewIndex dominance is clear on very large meshes Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add a comprehensive Algorithm section covering both the node-geometry and image-geometry paths from first principles: * Describe how every supported transform (rotation, scale, manual matrix, etc.) collapses to a single 4x4 homogeneous matrix M and how M composes with prior transforms * Node geometries: walk the 16K-vertex chunked read -> multiply -> write pipeline and explain why in-place topology+attribute data is correct * Image geometries: teach the re-gridding problem (why output voxels need to look up source values via M^-1), and contrast nearest- neighbor vs. trilinear interpolation * Z-slice slab cache: analytically deriving the per-output-slice source-Z range and the +/-2 trilinear margin * Sliding-window slab updates via memmove + delta copyIntoBuffer reads when consecutive output slices overlap heavily * Intra-slice parallelism via ParallelDataAlgorithm with thread safety argued from shared-read + disjoint-write access patterns and per-thread pValues scratch Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add an Algorithm section that walks the chunked pipeline step-by-step for a reader unfamiliar with the optimization: * Establish the closed-form per-triangle math (0.5 * |(A-B) x (A-C)|) so there is no confusion about the compute * Quantify the naive access pattern (six OOC chunk-cache hits per triangle, hundreds of millions of virtual dispatches on CT-scale meshes) to motivate the chunking * Walk the five-step per-chunk pipeline: bulk triangle connectivity read -> analyze vertex-index span -> span-bounded bulk vertex coords read -> parallel compute on plain buffers -> bulk area write * Explain the 16M-vertex span cap and the serial per-triangle fallback for pathological meshes * Summarize memory footprint (bounded O(chunk), not O(mesh)) Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Rewrite the Algorithm section to fully teach the filter: * State what the three output arrays (NumElements, Volume, EquivalentDiameter) represent and show the spherical/circular diameter formulas * Image Geometry path: explain the uniform-voxel-volume shortcut that lets the filter skip per-voxel volume computations, then walk the 256K-tuple chunked count pass and the per-feature output pass; cover the 2D fallback rules and the two-empty-dimensions preflight error * RectGrid path: contrast with the Image case, describe the lockstep FeatureIds + elementSizes chunked read, and explain why Kahan summation is needed to avoid float32 rounding error on billion-voxel volumes * Justify the 256K chunk size choice based on HDF5 chunk-lookup overhead vs. L2 cache residency * Summarize memory footprint Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Three sites in the algorithm multiplied double-precision resolution and angle values by k_PiOver180F (a float constant). Float promotion to double preserves the quantized float bits rather than recovering the true M_PI/180.0, introducing a ~1e-10 deviation from the legacy SIMPL algorithm (which uses double k_PiOver180). Over ~756k triangles times ~2300 symmetry operations the deviation flipped two near-boundary triangles in/out of the selected set, shifting a handful of distribution bin values by ~3e-2. Switch the three multiplications to the existing k_PiOver180D double constant so the resolution thresholds and fixed-misorientation angle are computed at full double precision. The stored 6_6_find_gbcd_metric_based.tar.gz exemplar was generated by the original float-precision DREAM3D FindGBCDMetricBased filter and no longer matches the simplnx algorithm after this fix. Publish a fresh exemplar from the double-precision legacy pipeline and repoint the tests at it. * Rename archive and top-level folder from 6_6_find_gbcd_metric_based to compute_gbcd_metric_based (drops the legacy 6_6_ prefix in accordance with current archive-naming conventions). * Drop the 6_6_ prefix from the stored .dat exemplar filenames; input .dream3d filename follows the folder name. * ComputeGBPDMetricBasedTest's InValid section reuses the GBCD archive (for crystal-structures and mesh input); update its paths too. * CMakeLists.txt download_test_data entry bumped to the new archive name and SHA512. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Adds a simple erase-by-key helper on Preferences to support callers that need to restore a preference to its "absent" state (rather than overwrite it with a specific value).
Introduces the ArgumentType and help-text surface for a new --ooc-memory-budget / -b flag. This commit only wires parsing and help; the override behavior is applied in a follow-up commit.
Applies the parsed --ooc-memory-budget value to the in-memory Preferences object before plugins load so SimplnxOocPlugin picks it up at construction. A stack-local RAII guard restores the pre-override preference state before the Application singleton saves preferences on shutdown, so the user's saved preference is untouched.
…g garbage
std::stod("nan") returns NaN, which passed the previous gb <= 0.0 check
(NaN compares false to everything) and then became undefined behavior when
cast to uint64. std::stod("8abc") also succeeded with a partial parse,
silently accepting junk input. Use the pos out-parameter to require the
entire argument be consumed, and std::isfinite to reject NaN and infinity.
std::isfinite requires <cmath>, which was being picked up only transitively via other STL headers. Include it directly to avoid future breakage if the transitive chain changes. Drop <stdexcept>; the only catch handler uses const std::exception& which is available via <exception>.
…x bool-mask bulk I/O Three logically related changes that finish reconciling the rebased branch with Nathan Young's PR BlueQuartzSoftware#1590 (ENH: Standardize 2D Image Handling) and fix one resulting OOC perf cliff: 1. Wholesale port of PR BlueQuartzSoftware#1590's two algorithm rewrites into the renamed in-core dispatch variants: - ComputeFeatureNeighborsDirect.cpp gets Nathan's templated ComputeFeatureNeighborsFunctor<ImageDimensionStateT> and ProcessVoxels dispatcher in place of the OOC-commit-era custom in-core logic. - IdentifySampleBFS.cpp gets Nathan's templated IdentifySampleFunctor plus the corresponding ProcessVoxels dispatch. The Scanline OOC variant of ComputeFeatureNeighbors is updated to reference the namespaced VoxelNeighbors<Image3D>:: constants while preserving its Z-slice rolling-window bulk-I/O structure. 2. Reapply PR BlueQuartzSoftware#1590's constexpr/const cleanups across the algorithm files where the rebase took --theirs (the OOC commit version) at the 2aa00ee conflict and dropped Nathan's small adjustments: SimplnxCore: ComputeBoundaryCellsDirect, ErodeDilateBadData, ErodeDilateCoordinationNumber, ErodeDilateMask, ReplaceElementAttributesWithNeighborValues, RequireMinimumSizeFeatures OrientationAnalysis: BadDataNeighborOrientationCheckWorklist, NeighborOrientationCorrelation The pattern is uniform: promote the inlined `6` neighbor-array sizes to use VoxelNeighbors<Image3D>::k_FaceNeighborCount via a local k_NumFaceNeighbors alias, make neighborVoxelIndexOffsets const, make faceNeighborInternalIdx constexpr, make isValidFaceNeighbor const where it is not mutated, drop the now-unused DataGroup.hpp include, and const-ify NeighborOrientationCorrelation's orientationOps. ComputeFeatureNeighborsFilter.md picks up Nathan's all-dimension note about user-set spacing for shared surface area calculation. 3. Fix a per-element OOC fallback in BadDataNeighborOrientationCheckScanline that was triggered whenever the input mask was a BoolArray rather than a UInt8Array. The previous code routed bool masks through maskCompare->isTrue / maskCompare->setValue per voxel per Z-slice, causing chunk thrashing under chunked OOC storage. The Small_IN100 pipeline test (a 189x201x117 volume with a bool mask produced by MultiThresholdObjects) ran in 4.7 s on simplnx-Rel but 3+ minutes on simplnx-ooc-Rel. AbstractDataStore<bool> already exposes copyIntoBuffer/copyFromBuffer just like AbstractDataStore<uint8>; the comment claiming otherwise was stale. Resolve a typed AbstractDataStore<bool>* alongside the existing uint8 store pointer and route both load and write-back through bulk I/O, with a small per-slice std::unique_ptr<bool[]> scratch buffer bridging between the algorithm's uint8 slice buffers and the bool data store's typed bulk API. With this change Small_IN100 OOC drops to 4.6 s (~1.6x in-core, in line with normal OOC overhead). Tests updated: - IdentifySampleTest.cpp adopts Nathan's PR BlueQuartzSoftware#1590 hand-built 2D Empty Z/Y/X Non-Square regression tests plus the parameterized identify_sample_v2 exemplar test and the SIMPL Backwards Compatibility test, all wrapped with the OOC dual-path pattern (ForceOocAlgorithmGuard + GENERATE(from_range(k_ForceOocTestValues))). The pre-existing 200x200x200 large-scale OOC validation test is retained. Verified: simplnx-Rel and simplnx-ooc-Rel preset builds both clean. All 43 affected-filter tests pass on simplnx-Rel; all 86 affected-filter tests pass on simplnx-ooc-Rel (regex covering ComputeFeatureNeighbors, IdentifySample, BadDataNeighborOrientation, ComputeBoundaryCells, ErodeDilate*, NeighborOrientationCorrelation, ReplaceElementAttributesWithNeighborValues, RequireMinimumSizeFeatures).
* Replace CreateDataStore + CreateResolvedDataStore with a single resolver-aware CreateDataStore(DataStructure, DataPath, ...) that always consults the registered format resolver. Old explicit-format overload deleted. * Replace CreateListStore similarly so NeighborList backing storage is OOC-eligible when the OOC plugin is loaded and thresholds permit. * Inline action-layer caller in ArrayCreationUtilities::CreateArray using GetIOCollection().createDataStoreWithType directly. * Migrate 23 CreateResolvedDataStore call sites (mechanical rename). * Migrate 13 cell-level test fixtures that were silently in-memory in OOC builds to the resolver-aware path so OOC builds actually exercise OOC stores. * Migrate 6 in-memory non-test callers (ComputeFeatureCentroids scratch buffers, HDF5 readers in DataStoreIO and DatasetIO) to direct std::make_shared<DataStore<T>> since they have no DataStructure context. * Migrate 2 NeighborListIO HDF5 readers to std::make_shared<ListStore<T>> for the same reason (in-core branch of the import pipeline). * Wire CreateNeighbors action helper through the resolver-aware CreateListStore. * Rewrite IOFormat.cpp tests to exercise the resolver path. ImageGeom and RectGridGeom findElementSizes now route through the new CreateDataStore so the voxel-sizes array can go OOC for very large structured grids. RectGridGeom's inner loop also refactored from per-voxel setValue calls to per-axis precompute + Z-slice copyFromBuffer to avoid catastrophic OOC perf when the array is OOC-backed. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
* Rename the CLI flag, ArgumentType enum entry, k_ constants, help function, local variable, and error-message strings to drop the "Ooc" prefix * Update calls into Preferences to match the renamed API (setMemoryBudgetBytes, memoryBudgetBytes, k_MemoryBudgetBytes_Key) The MemoryBudgetManager now lives in simplnx core (not the SimplnxOoc plugin) and applies to any cache subsystem that registers against it, so the "Ooc" qualifier is no longer accurate. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
dd7119e to
aff5a3a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds out-of-core (OOC) optimized algorithm variants for 30+ filters, using
DispatchAlgorithmto select between in-core (Direct/BFS) and OOC (Scanline/CCL) code paths at runtime based on data store type. A preparatory rename commit gives git rename tracking so that GitHub shows meaningful diffs against the original algorithm code.This PR contains only the filter optimization layer. The core OOC infrastructure (
copyIntoBuffer/copyFromBufferAPI,HDF5ChunkedStore,OocDataIOManager, etc.) is in a separateooc-architecture-rewritebranch that this PR stacks on top of.Branch Structure
Commit 0 — Rename for Git Tracking
Renames 13 algorithm files to their in-core variant names before any logic changes, so that when dispatch variants are introduced, GitHub shows proper diffs against the original code instead of "new file" with no context.
Bug Fixes
OOC import of legacy SIMPL files with multi-dimensional component arrays
Legacy SIMPL
.dream3dfiles store multi-dimensional component arrays (e.g., GBCD with componentShape[10,10,10,20,20,2]) with HDF5 physical dimensions in reversed order relative to theComponentDimensionsattribute.Two fixes address this at different layers:
AbstractOocStore::readHdf5(SimplnxOoc): Detects shape mismatch between logical and physical dimensions before the streaming import path. Falls back to flat bulk read (H5S_ALL) when shapes differ, preserving correct byte order.ImportH5ObjectPathsAction::backfillReadOnlyOocStores(simplnx): The read-only reference store optimization creates stores pointing directly at the source file. For mismatched arrays, the N-D hyperslabs would be out-of-bounds. Detects the mismatch and creates a writable OOC store populated viareadHdf5(which triggers the flat-read fallback) instead of a read-only reference.Filter Optimizations
Group B — Face-Neighbor Filters (5 filters)
Split into Direct (in-core) and Scanline (OOC) algorithm classes using
DispatchAlgorithm. Scanline variants use Z-slice rolling windows (prev/cur/next) for cross-slice neighbor access with zero per-element OOC overhead.Filters:
ComputeBoundaryCells,ComputeSurfaceFeatures,ComputeFeatureNeighbors,ComputeSurfaceAreaToVolume,BadDataNeighborOrientationCheckGroup C — Morphological / Neighbor Replacement (5 filters)
Z-slice rolling buffers for all 6 face-neighbor reads from RAM.
SliceBufferedTransferfor type-dispatched bulk tuple copy.Filters:
ErodeDilateBadData,ErodeDilateCoordinationNumber,ErodeDilateMask,ReplaceElementAttributesWithNeighborValues,NeighborOrientationCorrelationGroup D — CCL Segmentation (5 filters)
Chunk-sequential Connected Component Labeling using
UnionFindequivalence tracking, replacing BFS/DFS flood fill for OOC data.Filters:
ScalarSegmentFeatures,EBSDSegmentFeatures,CAxisSegmentFeatures,FillBadData,IdentifySampleGroup E — AlignSections Family (4 filters)
Bulk slice read/write via
AlignSectionsTransferDataOocImpl. Per-filter OOCfindShiftswith 2-slice buffers and bulk mask reads.Filters:
AlignSectionsMisorientation,AlignSectionsMutualInformation,AlignSectionsFeatureCentroid,AlignSectionsListFilterQuickSurfaceMesh
DispatchAlgorithm<QuickSurfaceMeshDirect, QuickSurfaceMeshScanline>. Scanline eliminates the O(volume)nodeIdsarray (7.5 GB for 1000³) with rolling 2-plane node buffers (16 MB). Two-pass architecture: counting pass + mesh creation pass. All output arrays (triangle connectivity, faceLabels, vertex coordinates, nodeTypes) buffered per z-slice and flushed withcopyFromBuffer. BatchquickSurfaceTransferBatchAPI added toTupleTransferfor bulk source-read/dest-write of cell and feature data.SurfaceNets
DispatchAlgorithm<SurfaceNetsDirect, SurfaceNetsScanline>. Scanline is a complete reimplementation (881 lines) eliminating the O(n)Cell[]array — uses O(surface) hash map + vertex vectors with slice-by-slice FeatureIds reading. All output arrays (vertices, nodeTypes, triangle connectivity, faceLabels) buffered and flushed withcopyFromBuffer. BatchsurfaceNetsTransferBatchAPI added toTupleTransferfor bulk I/O.Mesh Infrastructure (RepairTriangleWinding + GeometryHelpers)
RepairTriangleWinding: Bulk-reads triangle face list and faceLabels into local buffers; all BFS work operates on local memory; modified triangles written back viacopyFromBuffer.FindElementsContainingVert/FindElementNeighbors(GeometryHelpers.hpp): Chunked bulk I/O with 65K-element chunks for sequential passes. Random neighbor lookups check if candidate is in the current chunk (cache hit) before falling back to per-elementcopyIntoBuffer. Together with RepairTriangleWinding buffering, this reduced SurfaceNets Winding from 515s to 2.9s.Clustering Filters (3 filters)
DBSCAN:DispatchAlgorithm<DBSCANDirect, DBSCANScanline>— chunked grid construction, on-demand per-grid-cell coordinate reads incanMerge. 653s → 12s (54x)ComputeKMedoids:DispatchAlgorithm<Direct, Scanline>— chunkedfindClusters, per-clusteroptimizeClusterswith O(max_cluster_size) peak memory. 74s → 13s (5.7x)ComputeFeatureClustering: Single implementation with feature-level array caching. 203s → 77s (2.6x)Pipeline Prerequisite Filters (2 filters)
MultiThresholdObjects:DispatchAlgorithm<Direct, Scanline>— eliminates O(n)tempResultVectorin OOC pathConvertOrientations: Single implementation with chunked bulk I/O in macro-generated Convertor classes (4096-tuple chunks)Together these reduced the AlignSectionsMisorientation pipeline test from 635s to 5.9s (107x).
OrientationAnalysis Misc (10 filters)
ComputeTwinBoundaries: Bulk-read all face/feature/ensemble arrays into local vectors. 179s → 44s (4x)ComputeKernelAvgMisorientations: Slab-based bulk I/O with cached CrystalStructuresComputeAvgCAxes: Already OOC-optimized (chunked reads, cached feature output). Compute-bound.ReadH5Ebsd:copyFromBufferin CopyData template, phase copy, Euler interleaving. 463s → 241s (1.9x)ComputeGBCDPoleFigure:DispatchAlgorithm<Direct, Scanline>— Direct caches full GBCD, Scanline caches only the phase-of-interest slice (bounded by bin resolution, not cell count). 853s → 0.9s (948x)ComputeFeatureReferenceCAxisMisorientations: Z-slice buffered I/O for all cell-level arrays (featureIds, cellPhases, quats, output). Cached ensemble/feature-level arrays (crystalStructures, avgCAxes). 196s → 5.4s (36x)ComputeFeatureNeighborCAxisMisalignments: Bulk-read all feature-level arrays (featurePhases, featureAvgQuat, crystalStructures) and buffered avgCAxisMisalignment output.MergeTwins: Chunked bulk I/O for voxel-level parent ID fill and assignment loop. Feature-level featureParentIds cached locally for lookup. 67s → 1.8s (37x)ReadCtfData: BulkcopyFromBufferfor all cell arrays (phases, euler angles, bands, error, MAD, BC, BS, X, Y). Euler angle interleave uses chunked 64K buffer. Crystal structures cached locally for hex correction. 231s → 0.25sReadAngData: Same bulkcopyFromBufferpattern. Phase validation done in-place on EbsdLib buffer before single bulk write. Euler interleave chunked.Pipeline-Critical Filters (6 filters)
Optimizations targeting the filters responsible for OOC pipeline timeouts (4 of 5 timed-out pipelines blocked by
ComputeIPFColors):ComputeIPFColors:DispatchAlgorithm<ComputeIPFColorsDirect, ComputeIPFColorsScanline>. Direct keeps parallelParallelDataAlgorithmfor in-core; Scanline uses chunked sequential bulk I/O (65K-tuple chunks) with locally cached crystal structures.ForceOocAlgorithmGuardadded to test. 1,937ms → 90ms (21.5x)ComputeFeatureSizes: ChunkedcopyIntoBufferfor featureIds (ImageGeom path) and featureIds + elemSizes (RectGridGeom path with Kahan summation preserved). 813ms → 28ms (29x)ComputeAvgOrientations: Chunked featureIds/phases/quats reads, locally cached crystal structures and avgQuats (feature-level). BulkcopyFromBufferfor output arrays.ComputeFeatureReferenceMisorientations: Chunked all cell-level arrays (featureIds, phases, quats, GB distances, output misorientations). Locally cached crystal structures, avgQuats, and center quaternions (all feature/ensemble-level). 106ms → 1ms (106x)ComputeFeatureCentroids: ReplacedAbstractDataStoreintermediate arrays (sum, center, count, rangeX/Y/Z) with plainstd::vector— eliminates ~119M virtual dispatch calls per run. Chunked featureIds reads. Inline coordinate computation from spacing/origin. 39,724ms → 25ms (1,589x)RequireMinimumSizeFeatures: Three-part optimization:removeSmallFeatures: Chunked featureIds read/write (65K-tuple batches)assignBadVoxels: 3-slice rolling slab buffer for neighbor voting scan (O(slice) memory), sparse changed-voxel tracking to skip full-volume transfer when few/no voxels changed. 14,592ms → 142ms (103x)RemoveInactiveObjects(shared utility inDataGroupUtilities.cpp): Chunked featureIds renumbering withcopyIntoBuffer/copyFromBuffer. 5,573ms → 50ms (111x)Additional Filters
ComputeEuclideanDistMap: Bulk-read featureIds and distance stores into local vectors; flood-fill operates on local memory; bulk-write output. 116s → 1.1s (105x)AppendImageGeometry: Bulk I/O for mirror operations (scanline-based reversal instead of per-tuple swaps). 469s → 113s (4.2x)GBCD Filter Group (5 filters)
All five GBCD filters optimized for OOC with zero cell-level O(n) allocations, cancel checking, and progress messaging:
ComputeGBCDPoleFigure:DispatchAlgorithm<Direct, Scanline>withForceOocAlgorithmGuardin test. Scanline caches only the phase-of-interest GBCD slice viacopyIntoBuffer.WriteGBCDGMTFile: Phase-of-interest GBCD slice cached viacopyIntoBuffer; crystal structures cached locally.WriteGBCDTriangleData: Chunked triangle I/O (8K chunks), feature-level euler cache, buffered file output viafmt::format_to+fmt::memory_buffer.ComputeGBCD: Feature-level caching (eulers, phases, crystalStructures), chunked triangle array reads per 50K-triangle iteration, GBCD output accumulated in local buffer (bounded by phases × bins) then written back viacopyFromBuffer.ComputeGBCDMetricBased: Eliminated O(n)triIncludedallocation (replaced with per-chunk sequential area accumulation). Feature-level caching (phases, eulers, crystalStructures, featureFaceLabels). Chunked triangle I/O in totalFaceArea scan. Raw pointer access in parallel TrianglesSelector worker.HDF5 Import + Pole Figure Filters (3 filters)
FillOocDataStore(shared infrastructure): Streaming chunked HDF5 hyperslab reads +copyFromBuffer, with zero O(n) temp allocations — batched reads even for partial hyperslabs. Benefits all HDF5 import paths.ReadH5EspritData:copyFromBufferbulk writes from raw HDF5 reader buffers, replacing 9+ per-elementoperator[]writes per point.WritePoleFigure: Chunked iteration over eulerAngles/phases/mask per-phase using bounded buffers (no O(n) pre-caching); bulk-write intensity and image outputs viacopyFromBuffer.ReadHDF5Dataset: Cancel checking + per-dataset progress messages.WritePoleFigureTestandReadHDF5DatasetTestoptimized withcopyIntoBuffer.Core Utilities + Geometry Filters
ImportFromBinaryFile:copyFromBufferinstead of per-element writes. ReadRawBinary Case1: 1076s → 29s (37x)CropImageGeometry: Row-based bulk I/O. 27s → 2.6s (10x)RandomizeFeatureIds(ClusteringUtilities): Chunked bulk I/O for both overloads — benefits all callers (segmentation filters, SharedFeatureFace, MergeTwins).AppendData/CopyData/mirror swaps: Runtime OOC check — chunked bulk I/O for OOC, original code for in-core (verified zero in-core regression)TupleTransfer: AddedquickSurfaceTransferBatchandsurfaceNetsTransferBatchbatch APIs with bulkcopyIntoBuffer/copyFromBufferfor source reads and destination writes. Used by QuickSurfaceMeshScanline and SurfaceNetsScanline.Cancel + Progress Messaging
All in-core and OOC algorithm variants now have:
m_ShouldCancelchecks at the top of major outer loopsThrottledMessenger-based progress reporting with descriptive phase messages and percentage completionOOC Performance Results
All benchmarks on arm64 Release build with
forceOocData = true.Mesh Generation Filters (full ctest wall-clock, OOC build)
Groups B–E (200³ dataset, filter.execute() only)
Pipeline-Critical Filters (filter.execute() only, OOC build)
OrientationAnalysis Filters (full ctest wall-clock, OOC build)
GBCD Filter Group (full ctest wall-clock)
HDF5 Import + Pole Figure Filters (full ctest wall-clock)
Additional Optimizations (full ctest wall-clock)
Test Infrastructure
Rotation Filter Bulk I/O
RotateSampleRefFrame: Slab-based bulk I/O inRotateImageGeometryWithNearestNeighbor— reads source Z-slabs viacopyIntoBuffer, processes output slices into local buffers, writes viacopyFromBuffer. No O(n) allocation.RotateEulerRefFrame: ChunkedcopyIntoBuffer/copyFromBuffer(65K tuples per chunk). 19.5s → 4.8s (4x)Comparison Function Bulk I/O
CompareFloatArraysWithNans,CompareArrays, andCompareDataArraysByComponentin UnitTestCommon.hpp were doing per-elementoperator[]access, causing extreme slowdowns when comparing OOC-backed arrays. Replaced with chunkedcopyIntoBufferreads (40K elements per chunk), matching the existingCompareDataArrayspattern. This alone reduced the ComputeGBCD test from 1500s (timeout) to ~10s — the filter itself runs in ~3s.ForceOocAlgorithmGuardcoverage in all optimized filter tests for both algorithm pathsSIMPLNX_TEST_ALGORITHM_PATHCMake option (0=Both, 1=OOC-only, 2=InCore-only) for build-specific test path controlTest Plan