gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers#151498
Conversation
…nded headers
tarfile reads a member's extended header (a GNU long name/link or a pax
header) with a single read sized by the header's size field:
buf = tarfile.fileobj.read(self._block(self.size))
The size is taken from the archive and is not validated, so a ~512-byte
crafted file can claim several gigabytes (or, via base-256 encoding, far
more) and make read() pre-allocate that much memory -- on open/iterate,
before any extraction filter runs.
Read the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge allocation. The bytes
returned for valid archives are unchanged.
| # bounded chunks to avoid a huge up-front allocation when a crafted or | ||
| # truncated archive claims far more data than the file actually contains | ||
| # (gh-151497). | ||
| _EXTHEADER_READ_CHUNK = 1024 * 1024 # 1 MiB |
There was a problem hiding this comment.
I checked the _safe_read() argument when running test_tarfile. If I ignore the 4 GiB outlier, the size is between 512 bytes and 4 kiB. So a limit of 1 MiB sounds reasonable to me.
There was a problem hiding this comment.
I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.
| # bounded chunks to avoid a huge up-front allocation when a crafted or | ||
| # truncated archive claims far more data than the file actually contains | ||
| # (gh-151497). | ||
| _EXTHEADER_READ_CHUNK = 1024 * 1024 # 1 MiB |
There was a problem hiding this comment.
I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.
| """Read up to *size* bytes from *fileobj* in bounded chunks. | ||
|
|
||
| Returns the same bytes as ``fileobj.read(size)`` would (including a short | ||
| result at end of file), but never pre-allocates *size* bytes, so an |
There was a problem hiding this comment.
Nitpick: it will preallocate size bytes if size is small.
| result at end of file), but never pre-allocates *size* bytes, so an | |
| result at end of file), but limits pre-allocation, so an |
…, assert against _EXTHEADER_READ_CHUNK, fix _safe_read docstring
|
Thanks @vstinner and @encukou for the review — all addressed in 560b630:
Kept the 1 MiB chunk limit as discussed. PTAL when you get a chance. |
|
Thank you! |
|
Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.15. |
|
Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13. |
|
Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14. |
|
GH-151977 is a backport of this pull request to the 3.15 branch. |
|
GH-151978 is a backport of this pull request to the 3.13 branch. |
|
GH-151979 is a backport of this pull request to the 3.14 branch. |
tarfilereads a member's extended header (a GNU long name/link, or a paxheader) with a single read sized directly by the header's
sizefield:self.sizeis taken from the archive and is not validated, so a ~512-bytecrafted file can claim several gigabytes (or, via base-256 encoding, far more)
and make
read()pre-allocate that much memory — on open/iterate(
tarfile.open(...).getmembers()), before any extraction filter runs. A512-byte archive claiming 1 GiB drives a ~950 MiB resident allocation; a claim
of 1 TiB raises
MemoryErroreven on high-RAM machines.This reads the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge up-front allocation. The bytes
returned for valid archives are unchanged, and the change is safe for both
seekable and streaming (
r|) tars.