feat: rewrite the CosMx reader (stitching, multimodal, co-registration) by timtreis · Pull Request #406 · scverse/spatialdata-io

timtreis · 2026-06-28T15:43:32Z

Summary

This replaces the CosMx reader with a substantially expanded implementation developed in Nanostring-Biostats/spatialdata-io-ns. The current reader (~280 lines, single file) is replaced by a 5-module reader (~3.8k lines) that adds:

Multi-FOV stitching of morphology/protein images and label masks onto a global canvas (zarr-backed, memory-bounded).
Multimodal RNA + Protein runs.
Polygon ↔ label ↔ table co-registration (consistent global_cell_id across elements).
skip_empty_fovs — drops phantom FOVs listed in the positions file but absent on disk (avoids an inflated/desynced canvas).
Opt-in per-channel percentile image normalization (image_normalization_percentile, default None = legacy dtype-max) — recovers low-signal uint16 channels that otherwise render near-black; scale-only and reversible.
Robust handling of the known export-format variations (AtomX flat files, nested CellStatsDir, px-only / mm-only / px+mm FOV positions).

Reuses the existing CosmxKeys constant (values unchanged). Adds runtime deps polars, dask, zarr>=3. Other readers are untouched.

⚠️ Breaking changes (no backwards-compat shim)

The public cosmx() signature changes:

Old	New
`transcripts=True`	`read_transcripts=True` (+ `read_images/labels/proteins/polygons/gexp`)
`imread_kwargs: Mapping = MappingProxyType({})`	`imread_kwargs: dict \| None = None`
`image_models_kwargs: Mapping = MappingProxyType({})`	`image_models_kwargs: dict \| None = None`
—	new: `fovs`, `channels`, `n_workers`, `flip_image`, `image_normalization_percentile`, `polygons_as_labels`, `keep_polygons_after_rasterize`, `align_rasters_to_polygons`, `add_fovs_as_shapes`, `skip_empty_fovs`, `preview_fovs`

New public objects: CosMxDataset (frozen dataclass describing the on-disk layout) and CosMxDatasetReader.

Replace the CosMx reader with a substantially expanded implementation ported from Nanostring-Biostats/spatialdata-io-ns: multi-FOV stitching, multimodal RNA+Protein, polygon<->label<->table co-registration, skip_empty_fovs, and opt-in per-channel percentile image normalization. - 5 modules: cosmx.py + _cosmx_io/_cosmx_stitching/_cosmx_discovery/_cosmx_utils - reuses the existing CosmxKeys constant - adds runtime deps: polars, dask, zarr>=3 - synthetic regression test suite (tests/test_cosmx_regression.py) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Match the upstream reader convention: decorate cosmx() with @inject_docs and document the recognized CosMx file suffixes from CosmxKeys in the docstring.

readers/cosmx.py + the 4 _cosmx_* helper modules -> a readers/cosmx/ package (_reader/_io/_stitching/_discovery/_utils), with relative intra-package imports. No behavior change; all 86 tests pass.

The dev-era suite (82 tests) was ~2x upstream's heaviest reader test. Drop the brittle snapshot tests and the pure internal-helper unit tests (header matching, tile clipping) that are already covered end-to-end, and thin the remaining helper-unit classes to their highest-value cases. Keeps all end-to-end, feature (skip_empty_fovs, normalization, flip), and co-registration coverage. 56 tests (60 incl. parametrize), all passing.

Extract repeated boilerplate in _reader.py into small helpers, no behavior change: - _global_id_series: the AnnData unwrap + categorical->Int64 coercion (was in _collect_global_cell_ids_from_tables and _filter_tables_to_ids) - _parse_cell_table: the identical TableModel.parse(... region_key/instance_key/ overwrite_metadata) call at 3 sites - _parse_labels: the identical Labels2DModel.parse(... dims/translation) at 3 sites - _finalize_transcripts: the duplicated PointsModel.parse tail of the parquet and csv transcript readers - move the self-contained obs column-sanitization out to _utils._sanitize_obs_columns All 60 tests pass.

…hored) Make the single-modality path label-anchored, matching the multimodal path, so both follow one rule: the segmentation is the ground truth — rasterise every cell, then filter the table(s) to the cells that rasterised. Decided from real data: in multimodal runs the protein-derived segmentation is ground truth and modalities quantify DIFFERENT cell subsets (breast: RNA covers 99.1%, protein 100% of segmentation), so labels cannot be anchored on any single table. - single path: drop the table->polygon->label 'allowed_global_ids' cascade and the final-pass table filter; read_labels() rasterises all polygons, then the table is filtered to the label IDs (was: only rasterise polygons with a table row, dropping segmented-but-unquantified cells) - remove the now-dead allowed_global_ids param (read_labels/_labels_from_polygons) and _collect_global_cell_ids_from_tables - fix _prescan_max_cell_id: it read column 'cell_ID' but polygon CSVs use 'cellID', so polygon-only high cell IDs (exactly the orphan cells) were missed and mis-locked max_cell_id — now reads either column Validated on real data (breast multiomics, FOV 128): labels=649 (all polygons), RNA table=541, Protein table=649 — the 108 cells RNA lacks are kept in the labels. All 60 synthetic tests pass.

…scan A polygon cell with no expression row (highest cell_ID) must stay in the labels while the table holds only quantified cells; the high cell_ID also exercises the max_cell_id prescan reading the polygon 'cellID' column.

The pre-scan that locks max_cell_id matched the id column against a hardcoded ("cell_ID", "cellID") pair and swallowed every failure with a bare `except: continue`. Both could silently drop a file and mis-lock the max — the exact failure class behind the polygon-orphan mis-lock fixed in 190b121. - Extract `_match_canonical(hdr, canon)` from `_match_header` so a single alias table backs both; route the pre-scan's id- and fov-column lookups through it (now also matches `cell_id`/`object_id`/`roi`/`fov_id`, case-insensitive) instead of a narrower ad-hoc set. - Replace the blanket silent `except` with per-step warnings: a file that exists but can't be scanned, has no recognizable cell_ID column, or whose max can't be read is logged and skipped — never silently dropped. Add regression tests for aliased id/fov columns and the no-silent-drop warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop three re-implemented helpers in favour of the package/ecosystem equivalents, and attach the standard reader provenance every other reader sets. Net -50 LOC in the cosmx package; behaviour for valid inputs is unchanged (full regression suite + zarr round-trip green). - `_deduplicate_names` -> `anndata.utils.make_index_unique` for channel names (suffix format changes from "x (1)" to "x-1", the anndata standard). - `_sanitize_obs_columns` -> `spatialdata.sanitize_table`, the blessed zarr-safe key sanitiser (case-insensitive uniqueness across obs/var keys). - Add `_set_reader_metadata(sdata, "cosmx")` in `_assemble_sdata` so the output carries `spatialdata_io_reader`/`_software_version` like the other readers (cosmx previously set neither). Kept bespoke (no core equivalent): per-channel image normalization, categorical->string coercion, and the CSV-header canonicaliser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

timtreis and others added 9 commits June 28, 2026 17:18

feat(cosmx): add @inject_docs(cx=CosmxKeys) and document file suffixes

fd049a7

Match the upstream reader convention: decorate cosmx() with @inject_docs and document the recognized CosMx file suffixes from CosmxKeys in the docstring.

refactor(cosmx): move the reader into a cosmx/ subpackage

a0b1ab7

readers/cosmx.py + the 4 _cosmx_* helper modules -> a readers/cosmx/ package (_reader/_io/_stitching/_discovery/_utils), with relative intra-package imports. No behavior change; all 86 tests pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: rewrite the CosMx reader (stitching, multimodal, co-registration)#406

feat: rewrite the CosMx reader (stitching, multimodal, co-registration)#406
timtreis wants to merge 9 commits into
scverse:mainfrom
timtreis:feat/cosmx-reader-rewrite

timtreis commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

timtreis commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

⚠️ Breaking changes (no backwards-compat shim)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

timtreis commented Jun 28, 2026 •

edited

Loading