Skip to content

fix: decode Gsj-region SA hits into split splice seeds#51

Closed
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/pass1-sjdb-seeding
Closed

fix: decode Gsj-region SA hits into split splice seeds#51
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/pass1-sjdb-seeding

Conversation

@pinin4fjords
Copy link
Copy Markdown
Contributor

Summary

Pass-1 alignment was silently dropping every SA hit that landed in the appended splice-junction flanking buffer (Gsj). Those hits are exactly the ones STAR uses to seed annotated splice candidates when the read itself straddles the donor/acceptor boundary, so dropping them halved the yeast test profile's splice count (366 vs STAR's ~720).

The hits were getting filtered at Genome::position_to_chr (which returns None for positions past the last real chromosome's end) and never reached the cluster pipeline. STAR handles them via the g1 >= P.nGenomeReal branch in seedSearchStart.cpp / Aligner.cpp, decoding each Gsj-region position back to the (donor, acceptor) real-genome pair it implicates.

What changed

  • Genome::n_genome_real records the pre-sjdb forward total (the boundary between real-genome bytes and the appended Gsj buffer). Set during from_fasta and on the load path; preserved across append_sjdb.
  • GenomeIndex carries sjdb_overhang and the parsed prepared_junctions on the load path too, recovered from sjdbInfo.txt. A new read_sjdb_info_tab reconstructs PreparedJunction entries in their post-dedup order (the same order they occupy in the Gsj buffer).
  • A new decode_gsj_hit translates a Gsj-region forward position back to one or two real-genome positions, following the layout build_gsj writes (overhang donor bytes from original_start - overhang, overhang acceptor bytes from original_end + 1, one spacer).
  • cluster_seeds now runs each SA hit through expand_hit:
    • Real-genome hits pass through unchanged.
    • Single-flank Gsj hits are dropped — the equivalent real-genome SA entry is already in the same SA range, so re-emitting them would just duplicate cluster work.
    • Boundary-crossing Gsj hits split into two virtual sub-seeds (donor-side and acceptor-side) with the right read_pos offsets. The existing stitch DP chains them via its splice branch.
  • Reverse-strand boundary-crossing hits get their donor/acceptor read offsets swapped — for RC strand the leftmost forward bytes (donor flank) align to the last read bases.

Verification

Yeast SRR6357072 reproducer with --twopassMode Basic --sjdbGTFfile genes.gtf --sjdbOverhang 100:

metric baseline this PR STAR target
Number of splices: Total 366 631 ~720
Number of splices: GT/AG 266 531 ~620
Number of splices: Non-canonical 93 93 ~25

Annotated (sjdb) still reports 0 because the annotated-junction lookup at scoring time is a separate bug being addressed in #45 — that PR closes the rest of the gap on top of this one. Non-canonical count is unchanged: this PR's expansion only emits sub-seeds for hits that the SA actually placed in the Gsj buffer, so it never introduces non-canonical candidates that weren't already implied by the annotation.

Test plan

  • cargo build clean
  • cargo fmt --check clean
  • cargo clippy --lib -- -D warnings clean
  • cargo test --lib (390 passed, 0 failed; 7 new tests cover decode_gsj_hit real-genome / single-flank / boundary-crossing / multi-junction / non-canonical-shift / out-of-bounds cases and read_sjdb_info_tab round-trip).
  • End-to-end yeast reproducer numbers above.

Fixes #47

Pass-1 alignment was silently dropping SA hits that fall in the appended
splice-junction flanking buffer (Gsj). Those hits encode candidate
splice events for annotated junctions: when a read seed straddles the
donor/acceptor boundary of a Gsj slot, the matching genome bytes live
in two non-adjacent real-genome positions. STAR decodes these via
g1 >= P.nGenomeReal and feeds the (donor, acceptor) pair into the seed
pipeline; rustar-aligner was discarding them at position_to_chr (which
returns None for positions past the last real chromosome).

Plumb n_genome_real onto Genome (pinned at the pre-sjdb forward total)
and reload prepared junctions + sjdbOverhang from sjdbInfo.txt on the
index-load path. cluster_seeds now expands each SA hit through
decode_gsj_hit: real-genome hits pass through unchanged, single-flank
Gsj hits are skipped (the equivalent real-genome SA entry already
exists in the same range), and boundary-crossing Gsj hits split into
two virtual seeds with the donor-side read run and the acceptor-side
read run pointing at their respective real-genome positions. The
existing stitch DP then chains them via its splice branch.

Verified on the yeast SRR6357072 reproducer from the issue:
Number of splices: Total jumps from 366 to 631 (target ~720) and
GT/AG from 266 to 531 with non-canonical unchanged at 93. The
remaining gap is gated on PR scverse#45's annotated-junction lookup landing.

Fixes scverse#47

Co-Authored-By: Claude <noreply@anthropic.com>
@pinin4fjords
Copy link
Copy Markdown
Contributor Author

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch. Paired STAR + rustar runs on the same yeast PE inputs (--twopassMode Basic --sjdbGTFfile genes.gtf --runRNGseed 0, 50k reads, chrI).

Log.final.out splice section:

Field rustar pre-fix rustar after #51 STAR 2.7.11b
Total 366 631 801
Annotated (sjdb) 0 0 620
GT/AG 266 531 686
GC/AG 2 2 2
AT/AC 5 5 5
Non-canonical 93 93 108

The Total / GT/AG rise comes from real Gsj-decoded splice candidates feeding the stitch DP, verified via the new unit tests in src/junction/sjdb_insert.rs (all 7 pass). The decoder coverage is:

  • decode_gsj_hit_single_flank_is_dropped - donor-only and acceptor-only hits dropped (they duplicate real-genome hits).
  • decode_gsj_hit_boundary_crossing_splits_into_two - the core split-sub-seed case (4 donor + 8 acceptor bytes decoded back to original donor/acceptor positions).
  • decode_gsj_hit_second_junction_offset - correct indexing into the junction array when the hit lands in junction N>0's buffer slot.
  • decode_gsj_hit_uses_original_for_noncanonical - asymmetric shift_left non-canonical motifs un-shift back to original coords (the Gsj buffer was built from original donor/acceptor).
  • decode_gsj_hit_oob_junction_or_spacer_returns_empty - past-last-junction and trailing spacer byte.
  • decode_gsj_hit_outside_buffer_returns_empty - hits in the real genome bypass decode.
  • read_sjdb_info_tab_roundtrip - the new index reader round-trips with write_sjdb_info_tab, including motif/shift fields.

Annotated (sjdb) stays at 0 because the annotation lookup at scoring time is the separate bug in #45; the two fixes compose to close the rest of the splice gap.

Non-canonical did not change (93 -> 93). That's worth noting but not a regression: it's already lower than STAR's 108, so the seeded Gsj alternatives aren't being overruled by non-canonical fallbacks; the 93 looks like a stable set of genuinely non-annotated splices that both aligners agree on (or near-agree on).

SJ.out.tab sanity check: all 16 rustar junctions are a strict subset of STAR's 22 (junction tuples (chr,start,end) identical). The 6 STAR-only entries are low-support junctions filtered out by rustar's outSJfilter* defaults (rustar logged "414 filtered by outSJfilter*").

LGTM. Methodological parity with STAR's Gsj-based pass-1 seeding.

Psy-Fer added a commit that referenced this pull request May 29, 2026
@Psy-Fer Psy-Fer mentioned this pull request May 29, 2026
7 tasks
Psy-Fer added a commit that referenced this pull request May 29, 2026
* feat(params): accept --limitGenomeGenerateRAM for STAR CLI compatibility

STAR exposes --limitGenomeGenerateRAM as a memory cap for genomeGenerate.
Pipelines that wrap STAR commonly derive a value from their task resources
and pass it through. Currently rustar rejects the flag at the CLI parser,
breaking those pipelines.

Accept the flag with STAR's default of 31 GiB. The value is parsed but not
enforced — rustar's memory management is independent.

Fixes #25

* fix(chimeric): create output parent directory before writing

The chimeric output writer constructs its path as
<outFileNamePrefix>/Chimeric.out.junction, treating the prefix as
a directory regardless of whether it ends in `/`. In two-pass mode
the chim writer fires before any other output creates that directory,
so the file open fails with "No such file or directory" and the
entire run aborts.

Call create_dir_all on the parent of the chim output path before
opening the file. Without two-pass mode the bug was masked because
another output writer happened to create the dir first.

Fixes #35

* fix(bam): subtract Phred+33 offset from QUAL before writing to BAM

The BAM binary QUAL field stores raw Phred values (0-93) per SAM spec
§4.2.3. rustar was passing FASTQ ASCII bytes (Phred+33 encoded, e.g.
'B' = ASCII 66 = Phred 33) straight to noodles' QualityScores::from,
which expects raw Phred values. Every QUAL byte in a rustar BAM was
therefore +33 above what the spec mandates; samtools view rendered
the SAM text with another +33 on top, producing biologically
impossible quality strings like 'cccgg...'.

Subtract 33 from each FASTQ ASCII byte at every BAM-write call site
before constructing the QualityScores. Use saturating_sub so a
malformed FASTQ byte < 33 clamps to 0 rather than underflowing.

samtools stats average_quality on the nf-core/rnaseq test profile now
reads 35.3 (matching STAR) rather than the previous 68.3.

Fixes #34

* fix(transcriptome-bam): write per-record RG:Z: tag matching @rg header

When --outSAMattrRGline is supplied, rustar writes the @rg header to
both the genome and transcriptome BAMs, but only stamps the per-record
RG:Z: tag on the genome BAM. The transcriptome BAM was emitting
0 RG:Z: records vs STAR's 1-per-record output.

Add the RG tag stamp to the transcriptome record builder (paired-end
and single-end paths) so every record carries RG:Z:<id> matching the
@rg header, byte-symmetric with STAR.

Fixes #32

* feat(bam): populate @pg header with PN/VN/CL fields

Per SAM spec §1.3, the @pg line conventionally carries PN (program name),
VN (version), and CL (command line) alongside ID. rustar was emitting
only ID:rustar-aligner, leaving downstream provenance tools (MultiQC's
program-version table, dx-toolkit lineage tracking) with a blank entry.

Expand the header writer to emit:
  @pg\tID:rustar-aligner\tPN:rustar-aligner\tVN:<cargo pkg version>\tCL:<args>

The full command line is captured in main() before clap parses it, then
threaded into Parameters via a new (skip) field so it reaches the SAM
header builder. Version comes from CARGO_PKG_VERSION at compile time.

This matches STAR's @pg format and gives downstream tools the provenance
they need.

Fixes #33 (the @pg header gap; AS divergence is a separate item).

* chore(params): trim verbose doc comment

* chore(chimeric): remove narrative test comment

* chore(bam): remove narrative inline comment

* chore(transcriptome-bam): remove narrative comments

* chore(bam): remove narrative comments

* fix(transcriptome-bam): populate mate fields on paired-end records

build_transcriptome_records_pe was stamping SEGMENTED + FIRST/LAST_SEGMENT
but leaving PROPERLY_SEGMENTED, MATE_REVERSE_COMPLEMENTED, RNEXT, PNEXT,
and TLEN unset on every transcriptome record. Salmon's alignment-mode
fragment-length inference reads those mate fields; with them missing it
falls back to its 250+/-25 prior, distorting paired-end TPMs.

Walk the projected (rec1, rec2) pairs after the existing flag-stamping
loops and mirror the mate-bookkeeping the genome-space build_paired_mate_record
already does: PROPERLY_SEGMENTED + MATE_REVERSE_COMPLEMENTED + mate ref id +
mate position + signed TLEN (leftmost +, rightmost -).

Fixes #22

* fix(bam): emit SAM-spec NM:i: tag distinct from STAR-internal nM:i:

--outSAMattributes NM was being routed to the existing nM writer
(substitutions only), so requests for the SAM-spec edit-distance tag
silently produced wrong values. Downstream tools that read NM:i:
(samtools stats, picard, MultiQC) saw nothing.

Treat NM and nM as distinct attribute tokens. When NM is requested,
compute SAM-spec edit distance per the SAM v1 spec section 1.4:
substitutions + inserted bases + deleted bases (excluding intron N
skips). Keep nM emission unchanged when explicitly requested.

Fixes #29

* feat(bam): emit XS:A: strand tag when --outSAMstrandField intronMotif

The flag was accepted at the CLI parser but never produced any XS:A:
tags on the output BAM. Downstream tools that need strand inference
(StringTie, Cufflinks, rseqc infer_experiment.py) saw nothing.

Mirror STAR's two coupling paths: --outSAMattributes XS auto-enables
--outSAMstrandField intronMotif, and vice versa. For each output record
with at least one canonical intron motif, map the motif to + or - per
STAR's convention; non-canonical or mixed motifs omit the tag.

Fixes #30

* feat(output): emit Log.out, Log.progress.out, _STARpass1/SJ.out.tab

STAR writes three log files alongside Log.final.out and keeps two-pass
intermediates in <prefix>_STARpass1/. rustar was only writing
Log.final.out and emitting the pass-1 SJ tab as
<prefix>SJ.pass1.out.tab at the top level.

Add minimal Log.out (parameters dump + per-phase timestamps) and
Log.progress.out (timestamp + mapping speed) writers next to the
existing Log.final.out writer. Move the pass-1 SJ tab into
<prefix>_STARpass1/SJ.out.tab and mkdir the parent first.

The Log.out content is intentionally a stub matching the file's
existence rather than STAR's full verbosity; that's a follow-up.

Fixes #28

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates

The DB was keyed in chromosome-local 1-based coords (straight from the
GTF) but consulted in genome-absolute 0-based at stitch time, so every
sjdb lookup during alignment missed. Annotated splice events never got
the sjdb_score bonus and the stricter overhang gate fired in their
place, dropping ~50 % of GT/AG splices on the test profile.

The stats-time site queried in genome-absolute 1-based-equivalent, so
it accidentally matched on chr 0 (chr_start=0) but missed on every
other chromosome -- producing inconsistent answers between SJ.out.tab
and Log.final.out on the same BAM.

Normalise the DB to genome-absolute 0-based at construction, matching
the convention prepared_junctions and SpliceJunctionStats already use.
Update the stats-time call site to query in the new space. Stitch-time
needs no change.

Fixes #27

Co-Authored-By: Claude <noreply@anthropic.com>

* revert(stats): drop Log.out / Log.progress.out stub writers

The previous commit added writers that mimicked STAR's section-header
structure with placeholder content (a Debug-format params dump + three
timestamps for Log.out, a header + final-timestamp line for
Log.progress.out). That closes the file-existence gap but makes the
output look like STAR-equivalent verbose logging when it isn't —
downstream tools parsing for memory usage, per-chunk progress, or
warnings would silently get nothing.

Drop the stub writers. The SJ.pass1.out.tab → <prefix>_STARpass1/SJ.out.tab
placement fix stays — that's a real path-parity change with no fakery.

Real STAR-equivalent Log.out / Log.progress.out content (per-phase
progress, warnings, memory, periodic updates during long runs) is a
separate engineering effort, not a stub.

Co-Authored-By: Claude <noreply@anthropic.com>

* chore(params): warn at runtime when --limitGenomeGenerateRAM is set

Accepting the flag silently could mislead a user who passes a non-default
cap into thinking rustar enforces it. Emit a warning whenever the value
differs from STAR's default so the user knows the cap isn't applied.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(sjdb): off-by-one on stitch-time acceptor coordinate

The previous commit normalised the DB to genome-absolute 0-based and
updated the stats-time query. Stitch-time was left unchanged on the
assumption that it already passed genome-absolute 0-based — half right.
donor_fwd was correct (first intron base, 0-based) but acceptor_fwd =
donor_fwd + del landed on the first base AFTER the intron, while the
DB keys store the last intron base. Lookup missed every annotated
junction.

Subtract 1 from the acceptor to land on the last intron base. After
this, Log.final.out Annotated (sjdb) goes from 0 to a non-zero count
that's consistent with the annotated=1 row count in SJ.out.tab on
the same BAM.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: decode Gsj-region SA hits into split splice seeds

Pass-1 alignment was silently dropping SA hits that fall in the appended
splice-junction flanking buffer (Gsj). Those hits encode candidate
splice events for annotated junctions: when a read seed straddles the
donor/acceptor boundary of a Gsj slot, the matching genome bytes live
in two non-adjacent real-genome positions. STAR decodes these via
g1 >= P.nGenomeReal and feeds the (donor, acceptor) pair into the seed
pipeline; rustar-aligner was discarding them at position_to_chr (which
returns None for positions past the last real chromosome).

Plumb n_genome_real onto Genome (pinned at the pre-sjdb forward total)
and reload prepared junctions + sjdbOverhang from sjdbInfo.txt on the
index-load path. cluster_seeds now expands each SA hit through
decode_gsj_hit: real-genome hits pass through unchanged, single-flank
Gsj hits are skipped (the equivalent real-genome SA entry already
exists in the same range), and boundary-crossing Gsj hits split into
two virtual seeds with the donor-side read run and the acceptor-side
read run pointing at their respective real-genome positions. The
existing stitch DP then chains them via its splice branch.

Verified on the yeast SRR6357072 reproducer from the issue:
Number of splices: Total jumps from 366 to 631 (target ~720) and
GT/AG from 266 to 531 with non-canonical unchanged at 93. The
remaining gap is gated on PR #45's annotated-junction lookup landing.

Fixes #47

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(stitch): use sjdb_score in place of motif_score on annotated junctions

STAR's Transcript_alignScore.cpp adds either sjdbScore (annotated) or
motifScore (unannotated) to the alignment score, not both. rustar was
adding both on annotated junctions, over-crediting the score by
motif_score per junction.

Currently invisible because #27 keeps is_annotated false in practice;
becomes the dominant remaining AS divergence on annotated splices as
soon as #27 lands.

Fixes #50

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(score): score_annotated_junction helper uses replacement not addition

The helper at src/align/score.rs:137-143 had the same additive bug as
the stitch-time call site this PR fixes — adding sjdb_score to the
motif score rather than replacing it. Currently only called from its
own tests, but a landmine if any other caller picks it up.

Switch to replacement semantics matching the stitch path. Update the
tests that locked in the old (wrong) additive result.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(stats): route reads exceeding --outFilterMultimapNmax to too_many_loci

The existing AlignmentStats::record_alignment routing arm for n_alignments
> max_multimaps was unreachable on the PE path - the too-many-loci filter
returned n_for_mapq = 0, so the caller hit record_alignment(0, ...) and
incremented the unmapped bucket. Log.final.out always reported 0 in the
too-many-loci field and the per-bucket sum fell 3-of-50000 short of input
on the yeast PE test profile (STAR records exactly 3 there on the same data).

Return joint_pairs.len() as n_for_mapq from the PE filter site and route
through record_alignment with that count, mirroring the SE caller pattern
so the _ => arm fires. The read still isn't emitted in the BAM with
outSAMmultNmax=-1, matching STAR; only the accounting changes.

Fixes #53

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor(stats): drop redundant n_for_mapq conditional in PE caller

n_for_mapq is usize, so `if n > 0 { n } else { 0 }` is identical to
passing n_for_mapq directly to record_alignment. Match the SE caller
shape: the variable already carries the right count from the PE
TooManyLoci return site, and record_alignment routes via max_multimaps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix test

* upstream changes

* chore: switch sam attributes to bitflags

* Extend PR #36 for human readable ram entries, like 64G

* add missing n_genome_real for #51

* Fix CigarOp to cigar::op::new

* fmt fix

* Update run_tests.sh

---------

Co-authored-by: Jonathan Manning <jonathan.manning@seqera.io>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Philipp A. <flying-sheep@web.de>
@Psy-Fer
Copy link
Copy Markdown
Collaborator

Psy-Fer commented May 29, 2026

merged in #77

@Psy-Fer Psy-Fer closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pass-1 alignment doesn't seed candidates from --sjdbGTFfile; ~50% of GT/AG splices missed even after #27

3 participants