Skip to content

Fix 64-bit FP numbers colliding with SAS subheader signatures#369

Merged
evanmiller merged 1 commit intoWizardMac:devfrom
hpoettker:fix-signature-collisions
Apr 24, 2026
Merged

Fix 64-bit FP numbers colliding with SAS subheader signatures#369
evanmiller merged 1 commit intoWizardMac:devfrom
hpoettker:fix-signature-collisions

Conversation

@hpoettker
Copy link
Copy Markdown
Contributor

This PR fixes a bug in reading little-endian 64-bit SAS data files, as typically created on e.g. Linux.

Reproducing the problem

The PR contains an additional test case, which fails with the current dev branch and succeeds with the changes from this PR.

It can also be reproduced with any little-endian 64-bit SAS data file that

  • stores uncompressed rows in subheaders,
  • has a column of (64-bit) floating point numbers,
  • contains a value in the first floating point number column whose less significant 4 bytes coincide with one of the recognized 32-bit SAS subheader signatures.

The bug presents itself in rows that are missed (in case of subheader signatures that are ignored) or parsing errors.

The write feature of ReadStat handles the distinction between 32-bit and 64-bit subheader signatures correctly. It's only the read feature that is affected by the bug.

Root cause

ReadStat currently uses 4-byte constants in a too simple way to detect special subheader signatures in SAS data files, independent of whether these files use 4-byte (in the case of 32-bit files) or 8-byte (in the case of 64-bit files) signatures. This produces problems with subheaders containing data, which should be identified by not starting with a recognized subheader signature.

An example would be a row that in its first floating point number column contains the value 0.0010449746331455659, which is written as F7 F7 F7 F7 F0 1E 51 3F in a little-endian 64-bit data file. When ReadStat currently sees such a row, it only checks the first (less significant) 4 bytes F7 F7 F7 F7, which it accepts as the signature of a row size subheader.

Proposed fix

The PR proposes to differentiate between the subheader signatures (which remain 4-byte constants) and subheader types (which are represented by a new enum).

The PR is inspired by the write feature of ReadStat which also uses only the 4-byte constants for subheader signatures but pads them with the required zeros or ones in the case of 64-bit data files. Similarly, the PR proposes to not only check the significant 4-byte signatures but to also verify the expected padding in case of 64-bit data files.

@evanmiller
Copy link
Copy Markdown
Contributor

Thanks!

@evanmiller evanmiller merged commit 365c743 into WizardMac:dev Apr 24, 2026
12 checks passed
@hpoettker hpoettker deleted the fix-signature-collisions branch April 24, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants