Skip to content

Use CRC32C for checksums instead of MD5#6442

Open
nadav-govari wants to merge 7 commits into
mainfrom
nadav/crc32
Open

Use CRC32C for checksums instead of MD5#6442
nadav-govari wants to merge 7 commits into
mainfrom
nadav/crc32

Conversation

@nadav-govari
Copy link
Copy Markdown
Collaborator

@nadav-govari nadav-govari commented May 18, 2026

Description

Spans show that MD5 computations can take on the order of seconds, and were being computed on the non-blocking threads. Luckily we were only doing multipart uploads on massive splits so the vast majority of our uploads were unaffected.

MD5 isn't recommended anymore anyway, and on modern architecture, CRC32C has hardware acceleration. (Theres a software fallback if the architecture doesnt support it: https://docs.rs/crc32c/latest/crc32c/)

As for setting both checksum types on the completed part, this will only return the algorithm that the request was sent with: https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html

How was this PR tested?

Local deployment uploads succeed.

@nadav-govari nadav-govari requested a review from a team as a code owner May 18, 2026 20:23
self.s3_client
.create_multipart_upload()
.bucket(self.bucket.clone())
.checksum_algorithm(ChecksumAlgorithm::Crc32C)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah cool ! I did not know it was possible

@fulmicoton-dd
Copy link
Copy Markdown
Collaborator

It is a great idea, but it will break compatibility on most non S3 object storage relying on this client.

@nadav-govari nadav-govari requested a review from a team as a code owner May 19, 2026 15:18
disable_multipart_upload: bool,
// `None` means native AWS S3. Non-S3 flavors (GCS, MinIO, Garage, DigitalOcean) do not
// reliably honor CRC32C, so we only request the SDK to compute & send it for native S3.
flavor: Option<StorageBackendFlavor>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have a proper checksum_algorithm: ChecksumAlgorithm on S3StorageConfig defaulting to CRC32C and otherwise MD5 for all other flavors. I'd also like to keep the ability to easily revert to MD5 if something goes wrong. It seems that GCP also supports CRC32C. We have ways to test that.

@trinity-1686a
Copy link
Copy Markdown
Contributor

trinity-1686a commented May 19, 2026

Garage supported CRC32C before its 1.0.0, minio has it since 2015 afaict, i think it's fine to use this algorithm by default for all flavours, and to give a config escape door for any s3 provider that doesn't have this

@nadav-govari
Copy link
Copy Markdown
Collaborator Author

@fulmicoton @trinity-1686a either of you want to take another look?

pub disable_multipart_upload: bool,
#[serde(default)]
pub disable_checksums: bool,
pub checksum_strategy: ChecksumStrategy,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also update docs/configuration/storage-config.md

Comment on lines +360 to +375
self.checksum_strategy = ChecksumStrategy::Disabled;
}
Some(StorageBackendFlavor::Garage) => {
self.region = Some("garage".to_string());
self.force_path_style_access = true;
self.disable_checksums = true;
self.checksum_strategy = ChecksumStrategy::Disabled;
}
Some(StorageBackendFlavor::Gcs) => {
self.disable_multi_object_delete = true;
self.disable_multipart_upload = true;
self.disable_checksums = true;
self.checksum_strategy = ChecksumStrategy::Disabled;
}
Some(StorageBackendFlavor::MinIO) => {
self.region = Some("minio".to_string());
self.force_path_style_access = true;
self.disable_checksums = true;
self.checksum_strategy = ChecksumStrategy::Disabled;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Garage and MinIO support crc32c (source: git history), Ceph documentation mention crc32c so i assume DO has it too, and guilload said Gcs supports it to, i think we can leave the checksum strategy as default for all flavours unless we end up with reports that it causes issues

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guilload said Gcs supports it to > I am not sure this is correct, but I think we use a specific client for GS no?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DO support: https://docs.digitalocean.com/reference/api/spaces/ (its supported)
To use CRC32C, you have to supply a GCP-specific header, which we won't be able to do from the S3 SDK. So I'll leave it disabled for GCS and enable it for the other two.

Ok(delete_requests)
}

#[tracing::instrument(skip_all, fields(part_number = part.part_number, num_bytes=part.len()))]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really care about tracing here? :-/

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually one of the more useful spans I've used - it's how I found that MD5 was taking a lot of compute. S3 operations can sometimes get rate limited and have variable throughput rates, which becomes more important on a compactor (which is the only work they do)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug or trace level is also an option.

#[serde(default)]
pub disable_multipart_upload: bool,
#[serde(default)]
pub disable_checksums: bool,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that does not look backward compatible. I think noone uses it, and they will get a clear error message, so it is probably ok if it is a calculated risk.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nadav, you can make this backward compatible with some serde trick. We can use an alias and then deserialize false to disabled and true to the new default.

@fulmicoton-dd
Copy link
Copy Markdown
Collaborator

Approved, but please update the doc as suggested by @trinity-1686a

Comment thread quickwit/quickwit-config/src/storage_config.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants