gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers by iamsharduld · Pull Request #151498 · python/cpython

iamsharduld · 2026-06-15T10:33:42Z

tarfile reads a member's extended header (a GNU long name/link, or a pax
header) with a single read sized directly by the header's size field:

buf = tarfile.fileobj.read(self._block(self.size))

self.size is taken from the archive and is not validated, so a ~512-byte
crafted file can claim several gigabytes (or, via base-256 encoding, far more)
and make read() pre-allocate that much memory — on open/iterate
(tarfile.open(...).getmembers()), before any extraction filter runs. A
512-byte archive claiming 1 GiB drives a ~950 MiB resident allocation; a claim
of 1 TiB raises MemoryError even on high-RAM machines.

This reads the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge up-front allocation. The bytes
returned for valid archives are unchanged, and the change is safe for both
seekable and streaming (r|) tars.

Issue: tarfile: memory exhaustion via oversized extended-header (GNU long name / pax) size field #151497

…nded headers tarfile reads a member's extended header (a GNU long name/link or a pax header) with a single read sized by the header's size field: buf = tarfile.fileobj.read(self._block(self.size)) The size is taken from the archive and is not validated, so a ~512-byte crafted file can claim several gigabytes (or, via base-256 encoding, far more) and make read() pre-allocate that much memory -- on open/iterate, before any extraction filter runs. Read the extended-header data in bounded chunks instead, so an oversized or truncated header can no longer force a huge allocation. The bytes returned for valid archives are unchanged.

vstinner

cc @encukou @cmaloney

vstinner · 2026-06-15T16:05:29Z

+# bounded chunks to avoid a huge up-front allocation when a crafted or
+# truncated archive claims far more data than the file actually contains
+# (gh-151497).
+_EXTHEADER_READ_CHUNK = 1024 * 1024  # 1 MiB


I checked the _safe_read() argument when running test_tarfile. If I ignore the 4 GiB outlier, the size is between 512 bytes and 4 kiB. So a limit of 1 MiB sounds reasonable to me.

I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.

encukou · 2026-06-16T11:34:45Z

+# bounded chunks to avoid a huge up-front allocation when a crafted or
+# truncated archive claims far more data than the file actually contains
+# (gh-151497).
+_EXTHEADER_READ_CHUNK = 1024 * 1024  # 1 MiB


I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.

encukou · 2026-06-16T11:36:51Z

+    """Read up to *size* bytes from *fileobj* in bounded chunks.
+
+    Returns the same bytes as ``fileobj.read(size)`` would (including a short
+    result at end of file), but never pre-allocates *size* bytes, so an


Nitpick: it will preallocate size bytes if size is small.

Suggested change

result at end of file), but never pre-allocates *size* bytes, so an

result at end of file), but limits pre-allocation, so an

…, assert against _EXTHEADER_READ_CHUNK, fix _safe_read docstring

iamsharduld · 2026-06-21T18:33:41Z

Thanks @vstinner and @encukou for the review — all addressed in 560b630:

Renamed _ReadSizeRecorder → ReadSizeRecorder and dropped the _ prefixes on crafted_archive() / check().
Decorated ExtendedHeaderMemoryTest with @support.cpython_only and assert against the private tarfile._EXTHEADER_READ_CHUNK instead of the magic 10 MiB (so it's assertLessEqual, since a single read of exactly the chunk size is expected).
Reworded the _safe_read docstring — it does pre-allocate for a small size, it just bounds the pre-allocation.

Kept the 1 MiB chunk limit as discussed. PTAL when you get a chance.

encukou · 2026-06-23T09:32:39Z

Thank you!

miss-islington-app · 2026-06-23T09:44:44Z

Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.15.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

miss-islington-app · 2026-06-23T09:44:44Z

Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

miss-islington-app · 2026-06-23T09:44:44Z

Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

bedevere-app · 2026-06-23T09:44:57Z

GH-151977 is a backport of this pull request to the 3.15 branch.

bedevere-app · 2026-06-23T09:45:03Z

GH-151978 is a backport of this pull request to the 3.13 branch.

bedevere-app · 2026-06-23T09:45:11Z

GH-151979 is a backport of this pull request to the 3.14 branch.

iamsharduld requested a review from ethanfurman as a code owner June 15, 2026 10:33

bedevere-app Bot added the awaiting review label Jun 15, 2026

bedevere-app Bot mentioned this pull request Jun 15, 2026

tarfile: memory exhaustion via oversized extended-header (GNU long name / pax) size field #151497

Open

vstinner reviewed Jun 15, 2026

View reviewed changes

encukou reviewed Jun 16, 2026

View reviewed changes

Address review: drop _ prefixes on test helpers, gate on cpython_only…

560b630

…, assert against _EXTHEADER_READ_CHUNK, fix _safe_read docstring

cmaloney reviewed Jun 22, 2026

View reviewed changes

Comment thread Lib/tarfile.py

encukou merged commit da99711 into python:main Jun 23, 2026
54 checks passed

bedevere-app Bot removed the awaiting review label Jun 23, 2026

encukou added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes needs backport to 3.15 pre-release feature fixes, bugs and security fixes labels Jun 23, 2026

bedevere-app Bot removed the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label Jun 23, 2026

bedevere-app Bot removed the needs backport to 3.13 bugs and security fixes label Jun 23, 2026

bedevere-app Bot removed the needs backport to 3.14 bugs and security fixes label Jun 23, 2026

	result at end of file), but never pre-allocates size bytes, so an
	result at end of file), but limits pre-allocation, so an

Uh oh!

Conversation

iamsharduld commented Jun 15, 2026

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinner Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

encukou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

encukou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

encukou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

iamsharduld commented Jun 21, 2026

Uh oh!

Uh oh!

Uh oh!

encukou commented Jun 23, 2026

Uh oh!

miss-islington-app Bot commented Jun 23, 2026

Uh oh!

miss-islington-app Bot commented Jun 23, 2026

Uh oh!

miss-islington-app Bot commented Jun 23, 2026

Uh oh!

bedevere-app Bot commented Jun 23, 2026

Uh oh!

bedevere-app Bot commented Jun 23, 2026

Uh oh!

bedevere-app Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants