Skip to content

fix(file source): handle concatenated gzip streams#25614

Open
thomasqueirozb wants to merge 9 commits into
masterfrom
fix/file-source-gzip-multi-stream
Open

fix(file source): handle concatenated gzip streams#25614
thomasqueirozb wants to merge 9 commits into
masterfrom
fix/file-source-gzip-multi-stream

Conversation

@thomasqueirozb

@thomasqueirozb thomasqueirozb commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

The async file source migration (#23612, v0.50.0) replaced flate2::bufread::MultiGzDecoder with async_compression::tokio::bufread::GzipDecoder. MultiGzDecoder handles concatenated gzip streams by design; GzipDecoder stops after the first member unless .multiple_members(true) is called, which was never done. This caused the file source and fingerprinter to silently drop all but the first gzip stream in multi-member files.

The fix introduces gzip_multiple_decoder in vector-common::compression — a thin wrapper that constructs a GzipDecoder with multiple_members enabled — and replaces all bare GzipDecoder::new call sites (file watcher, fingerprinter, aws_s3 source). GzipDecoder::new is now a denied method in clippy.toml to prevent recurrence.

Vector configuration

data_dir: /tmp/vector-test

sources:
  files:
    type: file
    include:
      - /tmp/vector-test/*.gz
    fingerprint:
      strategy: checksum
    read_from: beginning

sinks:
  out:
    type: console
    inputs: [files]
    encoding:
      codec: text

How did you test this PR?

Create a multi-member gzip file and a standard single-member gzip file:

mkdir -p /tmp/vector-test

# multi-stream: two separate gzip members concatenated
echo "multiple_1hello" | gzip -c >  /tmp/vector-test/multiple-stream.gz
echo "multiple_2world" | gzip -c >> /tmp/vector-test/multiple-stream.gz

# single-stream: two lines in one gzip member
printf "single_1hello\nsingle_2world\n" | gzip -c > /tmp/vector-test/single-stream.gz

Run vector with the config above. Expected output (order may vary):

multiple_1hello
multiple_2world
single_1hello
single_2world

Before this fix, multiple_2world was silently dropped because GzipDecoder stopped after the first member. To stress the path further, a third member was appended and all three were read correctly.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

@github-actions github-actions Bot added the domain: sources Anything related to the Vector's sources label Jun 12, 2026
@thomasqueirozb thomasqueirozb marked this pull request as ready for review June 15, 2026 14:16
@thomasqueirozb thomasqueirozb requested a review from a team as a code owner June 15, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File source no longer can decompress Gzip

1 participant