[BWARE] Handle hash columns in transform decoders and tighten decode metadata#2479
Open
Baunsgaard wants to merge 4 commits into
Open
[BWARE] Handle hash columns in transform decoders and tighten decode metadata#2479Baunsgaard wants to merge 4 commits into
Baunsgaard wants to merge 4 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2479 +/- ##
============================================
- Coverage 71.37% 71.35% -0.03%
+ Complexity 48749 48737 -12
============================================
Files 1571 1571
Lines 188912 188978 +66
Branches 37067 37078 +11
============================================
+ Hits 134845 134850 +5
- Misses 43601 43648 +47
- Partials 10466 10480 +14 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
… flow Reworks transform decoders so that hash-encoded columns survive the inverse-transform path, and tightens how decoder metadata (column indices, value mappings) is propagated and initialized. - Decoder: pass column-id arrays through decode/decodeFromMap so each decoder knows its own output column range - DecoderRecode: skip recode for hash columns, keep encoded ints passthrough; init metadata from frame consistently - DecoderDummycode: handle hash columns when expanding categorical bits; parallel decode path; sparse-friendly init - DecoderPassThrough / DecoderBin / DecoderComposite / DecoderFactory: consume the new column-id arrays from the dispatch layer - ColumnEncoderFeatureHash: align hash bookkeeping with the decode-side changes - Frame columns (HashMapToInt, StringArray): small support changes consumed by the decoder path above
Fix two regressions in the transform decode rewrite that broke encode/decode roundtrips on dummycoded/recoded frames: - DecoderDummycode.decodeSparse compared 0-based sparse column indexes against the 1-based _clPos/_cuPos bounds used by the dense path (in.get(i, k-1)). This shifted every lookup by one column, so the first category was never matched (decoded as null) and all others decoded one code too low. Shift the sparse bounds and index to be 0-based, matching the dense path. - Restore the public no-arg constructors on DecoderComposite, DecoderBin, DecoderPassThrough, and DecoderRecode. Decoder is Externalizable, and Spark broadcasts the top-level decoder via Java serialization, which requires a public no-arg constructor; without it deserialization fails with InvalidClassException on executors. Restores passing of TransformFrameEncodeColmapTest, TransformFrameEncodeDecodeTest, TransformCSVFrameEncodeDecodeTest, and TransformFrameEncodeDecodeTokenTest in single-node and Spark modes.
Cover the decoder paths touched by the hash-column and decode-metadata changes: parallel block decode equals serial decode, the sparse and dense dummycode decode paths agree, feature-hash columns decode through dummycode via the magic domain-size metadata, and bin columns whose source position is shifted by dummycoding of another column. Add exact inverse round-trip checks for recode and dummycode to validate the sparse binary-search decode against ground truth.
Replace the toLowerCase plus equals chain in the parse fallback with a length-based dispatch: a single char compare for the 1-char "t"/"f" tokens and compareToIgnoreCase for "true"/"false", matching the idiom already used in DoubleArray.parseDouble. This avoids allocating a lower-cased copy and rejects non-boolean strings immediately. Restore throwing DMLRuntimeException on unparseable input. The previous re-throw of the raw NumberFormatException changed the exception type and broke callers such as Array.extractDouble that expect DMLRuntimeException; the throw path is the genuinely-exceptional case, so the wrapping cost is irrelevant there.
25b1a67 to
d0083b1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reworks transform decoders so that hash-encoded columns survive the inverse-transform path, and tightens how decoder metadata (column indices, value mappings) is propagated and initialized.