Fix 64-bit FP numbers colliding with SAS subheader signatures#369
Merged
evanmiller merged 1 commit intoWizardMac:devfrom Apr 24, 2026
Merged
Fix 64-bit FP numbers colliding with SAS subheader signatures#369evanmiller merged 1 commit intoWizardMac:devfrom
evanmiller merged 1 commit intoWizardMac:devfrom
Conversation
Contributor
|
Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a bug in reading little-endian 64-bit SAS data files, as typically created on e.g. Linux.
Reproducing the problem
The PR contains an additional test case, which fails with the current
devbranch and succeeds with the changes from this PR.It can also be reproduced with any little-endian 64-bit SAS data file that
The bug presents itself in rows that are missed (in case of subheader signatures that are ignored) or parsing errors.
The write feature of ReadStat handles the distinction between 32-bit and 64-bit subheader signatures correctly. It's only the read feature that is affected by the bug.
Root cause
ReadStat currently uses 4-byte constants in a too simple way to detect special subheader signatures in SAS data files, independent of whether these files use 4-byte (in the case of 32-bit files) or 8-byte (in the case of 64-bit files) signatures. This produces problems with subheaders containing data, which should be identified by not starting with a recognized subheader signature.
An example would be a row that in its first floating point number column contains the value
0.0010449746331455659, which is written asF7 F7 F7 F7 F0 1E 51 3Fin a little-endian 64-bit data file. When ReadStat currently sees such a row, it only checks the first (less significant) 4 bytesF7 F7 F7 F7, which it accepts as the signature of a row size subheader.Proposed fix
The PR proposes to differentiate between the subheader signatures (which remain 4-byte constants) and subheader types (which are represented by a new enum).
The PR is inspired by the write feature of ReadStat which also uses only the 4-byte constants for subheader signatures but pads them with the required zeros or ones in the case of 64-bit data files. Similarly, the PR proposes to not only check the significant 4-byte signatures but to also verify the expected padding in case of 64-bit data files.