Skip to content

[BWARE] Handle hash columns in transform decoders and tighten decode metadata#2479

Open
Baunsgaard wants to merge 4 commits into
apache:mainfrom
Baunsgaard:split/decoderHash
Open

[BWARE] Handle hash columns in transform decoders and tighten decode metadata#2479
Baunsgaard wants to merge 4 commits into
apache:mainfrom
Baunsgaard:split/decoderHash

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

Reworks transform decoders so that hash-encoded columns survive the inverse-transform path, and tightens how decoder metadata (column indices, value mappings) is propagated and initialized.

  • Decoder: pass column-id arrays through decode/decodeFromMap so each decoder knows its own output column range
  • DecoderRecode: skip recode for hash columns, keep encoded ints passthrough; init metadata from frame consistently
  • DecoderDummycode: handle hash columns when expanding categorical bits; parallel decode path; sparse-friendly init
  • DecoderPassThrough / DecoderBin / DecoderComposite / DecoderFactory: consume the new column-id arrays from the dispatch layer
  • ColumnEncoderFeatureHash: align hash bookkeeping with the decode-side changes
  • Frame columns (HashMapToInt, StringArray): small support changes consumed by the decoder path above

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 52.89855% with 65 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.35%. Comparing base (88c26e2) to head (25b1a67).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...che/sysds/runtime/transform/decode/DecoderBin.java 33.33% 20 Missing and 2 partials ⚠️
...apache/sysds/runtime/transform/decode/Decoder.java 0.00% 16 Missing ⚠️
...sysds/runtime/frame/data/columns/HashMapToInt.java 0.00% 11 Missing ⚠️
...sds/runtime/transform/decode/DecoderDummycode.java 75.00% 6 Missing and 5 partials ⚠️
...s/runtime/transform/decode/DecoderPassThrough.java 70.00% 1 Missing and 2 partials ⚠️
.../sysds/runtime/transform/decode/DecoderRecode.java 81.81% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2479      +/-   ##
============================================
- Coverage     71.37%   71.35%   -0.03%     
+ Complexity    48749    48737      -12     
============================================
  Files          1571     1571              
  Lines        188912   188978      +66     
  Branches      37067    37078      +11     
============================================
+ Hits         134845   134850       +5     
- Misses        43601    43648      +47     
- Partials      10466    10480      +14     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… flow

Reworks transform decoders so that hash-encoded columns survive the
inverse-transform path, and tightens how decoder metadata (column
indices, value mappings) is propagated and initialized.

- Decoder: pass column-id arrays through decode/decodeFromMap so each
  decoder knows its own output column range
- DecoderRecode: skip recode for hash columns, keep encoded ints
  passthrough; init metadata from frame consistently
- DecoderDummycode: handle hash columns when expanding categorical
  bits; parallel decode path; sparse-friendly init
- DecoderPassThrough / DecoderBin / DecoderComposite / DecoderFactory:
  consume the new column-id arrays from the dispatch layer
- ColumnEncoderFeatureHash: align hash bookkeeping with the
  decode-side changes
- Frame columns (HashMapToInt, StringArray): small support changes
  consumed by the decoder path above
Fix two regressions in the transform decode rewrite that broke
encode/decode roundtrips on dummycoded/recoded frames:

- DecoderDummycode.decodeSparse compared 0-based sparse column indexes
  against the 1-based _clPos/_cuPos bounds used by the dense path
  (in.get(i, k-1)). This shifted every lookup by one column, so the
  first category was never matched (decoded as null) and all others
  decoded one code too low. Shift the sparse bounds and index to be
  0-based, matching the dense path.

- Restore the public no-arg constructors on DecoderComposite, DecoderBin,
  DecoderPassThrough, and DecoderRecode. Decoder is Externalizable, and
  Spark broadcasts the top-level decoder via Java serialization, which
  requires a public no-arg constructor; without it deserialization fails
  with InvalidClassException on executors.

Restores passing of TransformFrameEncodeColmapTest,
TransformFrameEncodeDecodeTest, TransformCSVFrameEncodeDecodeTest, and
TransformFrameEncodeDecodeTokenTest in single-node and Spark modes.
Cover the decoder paths touched by the hash-column and decode-metadata
changes: parallel block decode equals serial decode, the sparse and
dense dummycode decode paths agree, feature-hash columns decode through
dummycode via the magic domain-size metadata, and bin columns whose
source position is shifted by dummycoding of another column. Add exact
inverse round-trip checks for recode and dummycode to validate the
sparse binary-search decode against ground truth.
Replace the toLowerCase plus equals chain in the parse fallback with a
length-based dispatch: a single char compare for the 1-char "t"/"f"
tokens and compareToIgnoreCase for "true"/"false", matching the idiom
already used in DoubleArray.parseDouble. This avoids allocating a
lower-cased copy and rejects non-boolean strings immediately.

Restore throwing DMLRuntimeException on unparseable input. The previous
re-throw of the raw NumberFormatException changed the exception type and
broke callers such as Array.extractDouble that expect DMLRuntimeException;
the throw path is the genuinely-exceptional case, so the wrapping cost is
irrelevant there.
@Baunsgaard Baunsgaard force-pushed the split/decoderHash branch from 25b1a67 to d0083b1 Compare June 9, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant