-
Notifications
You must be signed in to change notification settings - Fork 30
docs: add per-column Parquet encoding and compression documentation #384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
2be2bf6
750adc0
5891cca
942dfc3
6c99710
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| --- | ||
| title: ALTER TABLE ALTER COLUMN SET/DROP PARQUET | ||
| sidebar_label: PARQUET ENCODING/COMPRESSION | ||
| description: ALTER TABLE ALTER COLUMN SET/DROP PARQUET SQL keyword reference documentation. | ||
| --- | ||
|
|
||
| Sets or removes per-column Parquet encoding and compression configuration on | ||
| existing tables. These settings only affect | ||
| [Parquet partitions](/docs/query/export-parquet/#in-place-conversion) and are | ||
| ignored for native partitions. | ||
|
|
||
| ## SET | ||
|
|
||
| Override the default Parquet encoding, compression, or both for a column. | ||
| The syntax is `SET PARQUET(encoding [, compression[(level)]])`. Use `default` | ||
| for the encoding when specifying compression only. | ||
|
|
||
| ```questdb-sql title="Set encoding only" | ||
| ALTER TABLE sensors ALTER COLUMN temperature SET PARQUET(rle_dictionary); | ||
| ``` | ||
|
|
||
| ```questdb-sql title="Set compression only (with optional level)" | ||
| ALTER TABLE sensors ALTER COLUMN temperature SET PARQUET(default, zstd(3)); | ||
| ``` | ||
|
|
||
| ```questdb-sql title="Set both encoding and compression" | ||
| ALTER TABLE sensors ALTER COLUMN temperature SET PARQUET(rle_dictionary, zstd(3)); | ||
| ``` | ||
|
|
||
| ## DROP | ||
|
|
||
| Reset per-column overrides back to the server defaults. | ||
|
|
||
| ```questdb-sql title="Reset to defaults" | ||
| ALTER TABLE sensors ALTER COLUMN temperature DROP PARQUET; | ||
| ``` | ||
|
|
||
| ## Supported encodings and codecs | ||
|
|
||
| See the [CREATE TABLE](/docs/query/sql/create-table/#supported-encodings) | ||
| reference for the full list of supported encodings, compression codecs, and | ||
| their valid column types. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -361,6 +361,99 @@ CREATE TABLE trades ( | |
| ) TIMESTAMP(timestamp); | ||
| ``` | ||
|
|
||
| ### Per-column Parquet encoding and compression | ||
|
|
||
|  | ||
|
|
||
| Column definitions may include an optional `PARQUET(encoding [, compression[(level)]])` | ||
| clause. These settings only affect | ||
| [Parquet partitions](/docs/query/export-parquet/#in-place-conversion) and are | ||
| ignored for native partitions. Both encoding and compression are optional — use | ||
| `default` for the encoding when specifying compression only. | ||
|
|
||
| ```questdb-sql title="CREATE TABLE with per-column Parquet config" | ||
| CREATE TABLE sensors ( | ||
| ts TIMESTAMP, | ||
| temperature DOUBLE PARQUET(rle_dictionary, zstd(3)), | ||
| humidity FLOAT PARQUET(rle_dictionary), | ||
| device_id VARCHAR PARQUET(default, lz4_raw), | ||
| status INT | ||
| ) TIMESTAMP(ts) PARTITION BY DAY; | ||
| ``` | ||
|
|
||
| When omitted, columns use the global defaults: a type-appropriate encoding and | ||
| the server-wide compression codec | ||
| (`cairo.partition.encoder.parquet.compression.codec`). | ||
|
|
||
| #### Supported encodings | ||
|
|
||
| | Encoding | SQL keyword | Valid column types | | ||
| | ----------------------- | ------------------------- | ---------------------------- | | ||
| | Plain | `plain` | All | | ||
| | RLE Dictionary | `rle_dictionary` | All except BOOLEAN and ARRAY | | ||
| | Delta Length Byte Array | `delta_length_byte_array` | STRING, BINARY, VARCHAR | | ||
| | Delta Binary Packed | `delta_binary_packed` | INT, LONG, DATE, TIMESTAMP | | ||
|
|
||
| - **Plain** — stores values as-is with no transformation. Simplest encoding | ||
| with no overhead. Use as a fallback when data has high cardinality and no | ||
| exploitable patterns (e.g. random floats or UUIDs). | ||
| - **RLE Dictionary** — builds a dictionary of unique values and replaces each | ||
| value with a short integer key. The keys are then encoded with a hybrid of | ||
| run-length encoding (for repeated consecutive keys) and bit-packing (for | ||
| non-repeating sequences). Best for low-to-medium cardinality columns (status | ||
| codes, device IDs, symbols). The lower the cardinality, the greater the | ||
| compression. | ||
| - **Delta Length Byte Array** — delta-encodes the lengths of consecutive | ||
| string/binary values, then stores the raw bytes back-to-back. This is the | ||
| Parquet-recommended encoding for byte array columns and is always preferred | ||
| over `plain` for STRING, BINARY, and VARCHAR. | ||
| - **Delta Binary Packed** — delta-encodes integer values and packs the deltas | ||
| into a compact binary representation. Effective for monotonically increasing | ||
| or slowly changing integer/timestamp columns (e.g. sequential IDs, event | ||
| timestamps). | ||
|
|
||
| For the full specification of each encoding, see the | ||
| [Apache Parquet encodings documentation](https://parquet.apache.org/docs/file-format/data-pages/encodings/). | ||
|
|
||
| When no encoding is specified, QuestDB picks a type-appropriate default: | ||
| `rle_dictionary` for SYMBOL and VARCHAR, `delta_length_byte_array` for STRING | ||
| and BINARY, and `plain` for everything else. | ||
|
|
||
| #### Supported compression codecs | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same thing as above. We should probably tell the good/bad for each method, or link to somewhere where this is explained
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, 942dfc3 adds this. |
||
|
|
||
| | Codec | SQL keyword | Level range | | ||
| | ------------ | -------------- | ----------- | | ||
| | LZ4 Raw | `lz4_raw` | -- | | ||
| | Zstd | `zstd` | 1-22 | | ||
| | Snappy | `snappy` | -- | | ||
| | Gzip | `gzip` | 1-9 | | ||
| | Brotli | `brotli` | 0-11 | | ||
| | Uncompressed | `uncompressed` | -- | | ||
|
|
||
| - **LZ4 Raw** — extremely fast compression and decompression with a moderate | ||
| ratio. No tunable level. This is the QuestDB default and a good choice for | ||
| most workloads where query throughput matters. | ||
| - **Zstd** — excellent balance of compression ratio and speed across its level | ||
| range. Lower levels (1-3) approach LZ4 speed with better ratios; higher | ||
| levels (up to 22) rival Brotli ratios. A strong general-purpose choice when | ||
| storage savings justify slightly slower decompression. | ||
| - **Snappy** — very fast compression and decompression with moderate ratio. No | ||
| tunable level. Similar trade-offs to LZ4 Raw. | ||
| - **Gzip** — widely supported, higher compression ratio than Snappy or LZ4 at | ||
| the cost of slower decompression, which reduces query throughput. Higher | ||
| levels (up to 9) improve ratio but further increase CPU time. | ||
| - **Brotli** — achieves some of the highest compression ratios, especially at | ||
| higher levels, but decompression is significantly slower. Best suited for | ||
| cold/archival data where storage savings outweigh query throughput. | ||
| - **Uncompressed** — no compression. Fastest decompression (none needed) but | ||
| largest file size. Useful when data is already incompressible. | ||
|
|
||
| For more details on Parquet compression, see the | ||
| [Apache Parquet compression documentation](https://parquet.apache.org/docs/file-format/data-pages/compression/). | ||
|
|
||
| To modify encoding or compression on existing tables, see | ||
| [ALTER TABLE ALTER COLUMN SET/DROP PARQUET](/docs/query/sql/alter-table-alter-column-parquet-encoding/). | ||
|
|
||
| ### Casting types | ||
|
|
||
| `castDef` - casts the type of a specific column. `columnRef` must reference | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explain what each encoding is good for? Even if just a link to an authoritative 3rd party docs. Without that, I have no idea why I would choose delta_binary_packed, as it is not used by default for numbers, so no idea when it can be convenient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely!
942dfc3 adds a reference to the official link and a small summary per encoding.