Skip to content

refactor: improve execution pipeline and TPCC backends#313

Merged
KKould merged 27 commits intomainfrom
refactor--hep_match
Mar 28, 2026
Merged

refactor: improve execution pipeline and TPCC backends#313
KKould merged 27 commits intomainfrom
refactor--hep_match

Conversation

@KKould
Copy link
Copy Markdown
Member

@KKould KKould commented Mar 27, 2026

What problem does this PR solve?

This branch is carrying a fairly large refactor, and the old PR body was no longer describing the actual review surface.

At a high level, this PR improves three areas together:

  • execution/storage paths still had a lot of avoidable cloning, buffering churn, and ownership friction
  • optimizer normalization and binding were still brittle around aliasing, passthrough projections, scalar subqueries, and position-sensitive column matching
  • TPCC benchmarking was hard to rerun consistently across backends, and the docs/results had drifted from the latest measurements

Issue link:

What is changed and how it works?

This PR is best understood as three related tracks.

1. Execution / storage pipeline refactor

  • Introduces the newer ExecArena-style execution model and follows through on it across the executor stack.
  • Unifies executor node APIs and trims a lot of per-node boilerplate.
  • Reuses result tuples and internal buffers more aggressively to reduce allocation churn across the execution pipeline.
  • Pushes more borrowed representations through storage/key/range/column-summary paths instead of repeatedly materializing owned values.
  • Removes the old memo-based optimizer path and moves physical implementation choice toward the direct heuristic pipeline used by the newer rules.
  • Adds LMDB as a first-class storage backend alongside the existing options.
  • Loads histogram sketches together with statistics metadata so optimizer-side statistics are cheaper to consume.

The main effect of this part is lower overhead and cleaner ownership boundaries across planning, execution, and storage.

2. Binder / expression / normalization improvements

  • Derives and remaps column positions more directly during binding instead of depending on fragile fallback behavior.
  • Improves scalar-subquery semantics and introduces the corresponding logical/implementation operator support needed by the newer execution path.
  • Makes several normalization rules compare expressions by logical column identity instead of overfitting to transient bound positions.
  • Adds a same_column helper on column refs and extends expression equality helpers so rules can safely ignore harmless column-ref slot differences.
  • Improves collapse rules for adjacent passthrough projections, redundant filters, and group-by-only aggregate layers.
  • Refines normalization scheduling and post-rule application so sort elimination / stream-distinct / column-pruning related rules compose more predictably.
  • Continues the work on predicate pushdown, column pruning, hint propagation, and preindex-aware normalization so more plans can exploit existing order/index information.

The important reviewer takeaway here is that this is not just code motion: it makes normalization and rebinding materially less brittle for real query shapes, especially around correlated and scalar-subquery cases.

3. TPCC backend split, runner, and docs refresh

  • Splits TPCC backend handling into clearer LMDB / RocksDB / SQLite / dual implementations.
  • Adds SQLite profile handling (balanced and practical) so the benchmark matrix can compare multiple SQLite operating modes directly.
  • Adds scripts/run_tpcc_matrix.sh to run the performance matrix in one shot and write timestamped raw logs plus a summary file under tpcc/results/<timestamp>/.
  • Adds duplicate-key retry handling in the matrix runner for the known TPCC history.h_date collision pattern so a single backend can be rerun from a fresh database without restarting the full matrix.
  • Expands Makefile TPCC targets to cover the new backend/profile combinations.
  • Refreshes README.md and tpcc/README.md with the latest measured results and documents the benchmark runner and the duplicate-key caveat.

Latest 720s comparison currently documented in the branch:

  • KiteSQL LMDB: 53510 TpmC
  • KiteSQL RocksDB: 32248 TpmC
  • SQLite balanced: 36273 TpmC
  • SQLite practical: 35516 TpmC

Code changes

  • Has Rust code change
  • Has CI related scripts change

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Manual test:

  • cargo build -p tpcc --release
  • TPCC_DUPLICATE_RETRY=1 ./scripts/run_tpcc_matrix.sh

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Note for reviewer

This PR is large, but the changes cluster fairly cleanly. A good review order is:

  1. execution/storage infrastructure changes
  2. binder/expression/normalization behavior changes
  3. TPCC backend split, runner, and docs

Representative files for each area:

  • execution/storage: src/execution/**, src/storage/**, src/db.rs
  • optimizer/binder: src/binder/select.rs, src/expression/mod.rs, src/catalog/column.rs, src/optimizer/rule/normalization/**, src/optimizer/heuristic/**
  • TPCC: tpcc/src/backend/**, tpcc/src/main.rs, scripts/run_tpcc_matrix.sh, tpcc/README.md, README.md

KKould added 22 commits March 28, 2026 04:55
- add LMDB storage backend
- restore scalar subqueries in WHERE to join-aware binding
- enforce scalar subquery cardinality at execution time
- return NULL for empty scalar subqueries and error on multi-row results
@KKould KKould force-pushed the refactor--hep_match branch from c9564fa to d4063b0 Compare March 27, 2026 20:55
@KKould KKould self-assigned this Mar 28, 2026
@KKould KKould added the enhancement New feature or request label Mar 28, 2026
@KKould KKould merged commit f77e71e into main Mar 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant