Fix Base45 dropping trailing bytes on non-ASCII input#45
Open
gaoflow wants to merge 1 commit into
Open
Conversation
The Base45 encoder and decoder iterated with range(0, len(text), step) while indexing into t = b(text). codext converts a bytes input to str (UTF-8) before the codec runs, so for any non-ASCII content b(text) is longer than text and len(text) stops the loop early, silently dropping the trailing byte(s). For example encode(b'\xcf\xb1\x1b') returned 'OBQ' instead of 'OBQR0' and the value no longer round-tripped. Iterate over len(t) (the actual byte sequence) instead. Output now matches RFC 9285 and the reference base45 implementation, and encoding round-trips for arbitrary byte input.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The Base45 codec silently drops trailing bytes when the input contains non-ASCII content, producing output that is too short and no longer round-trips.
Cause
base45_encode/base45_decodeiterate withrange(0, len(text), step)but index intot = b(text). Because the codec layer converts a bytes input tostr(UTF-8) before the codec runs,b(text)is longer thantextwhenever the content is non-ASCII, solen(text)ends the loop early and the final group is never emitted. Forb'\xcf\xb1\x1b'(\xcf\xb1decodes to the single character U+03F1),texthas length 2 while the byte sequence has length 3, dropping the third byte.Fix
Iterate over
len(t)(the actual byte sequence) in both functions. Encoded output now matches RFC 9285 and the referencebase45implementation, and encoding round-trips for arbitrary byte input.Tests
Added
test_codec_base45covering the RFC 9285 vectors (AB,Hello!!,base-45), the exact regression value (b'\xcf\xb1\x1b'->b'OBQR0'), and round-trips for binary inputs. The full test suite passes.I worked on this with AI assistance under my direction and reviewed the change myself.