refactor: improve execution pipeline and TPCC backends by KKould · Pull Request #313 · KipData/KiteSQL

KKould · 2026-03-27T18:38:41Z

What problem does this PR solve?

This branch is carrying a fairly large refactor, and the old PR body was no longer describing the actual review surface.

At a high level, this PR improves three areas together:

execution/storage paths still had a lot of avoidable cloning, buffering churn, and ownership friction
optimizer normalization and binding were still brittle around aliasing, passthrough projections, scalar subqueries, and position-sensitive column matching
TPCC benchmarking was hard to rerun consistently across backends, and the docs/results had drifted from the latest measurements

Issue link:

What is changed and how it works?

This PR is best understood as three related tracks.

1. Execution / storage pipeline refactor

Introduces the newer ExecArena-style execution model and follows through on it across the executor stack.
Unifies executor node APIs and trims a lot of per-node boilerplate.
Reuses result tuples and internal buffers more aggressively to reduce allocation churn across the execution pipeline.
Pushes more borrowed representations through storage/key/range/column-summary paths instead of repeatedly materializing owned values.
Removes the old memo-based optimizer path and moves physical implementation choice toward the direct heuristic pipeline used by the newer rules.
Adds LMDB as a first-class storage backend alongside the existing options.
Loads histogram sketches together with statistics metadata so optimizer-side statistics are cheaper to consume.

The main effect of this part is lower overhead and cleaner ownership boundaries across planning, execution, and storage.

2. Binder / expression / normalization improvements

Derives and remaps column positions more directly during binding instead of depending on fragile fallback behavior.
Improves scalar-subquery semantics and introduces the corresponding logical/implementation operator support needed by the newer execution path.
Makes several normalization rules compare expressions by logical column identity instead of overfitting to transient bound positions.
Adds a same_column helper on column refs and extends expression equality helpers so rules can safely ignore harmless column-ref slot differences.
Improves collapse rules for adjacent passthrough projections, redundant filters, and group-by-only aggregate layers.
Refines normalization scheduling and post-rule application so sort elimination / stream-distinct / column-pruning related rules compose more predictably.
Continues the work on predicate pushdown, column pruning, hint propagation, and preindex-aware normalization so more plans can exploit existing order/index information.

The important reviewer takeaway here is that this is not just code motion: it makes normalization and rebinding materially less brittle for real query shapes, especially around correlated and scalar-subquery cases.

3. TPCC backend split, runner, and docs refresh

Splits TPCC backend handling into clearer LMDB / RocksDB / SQLite / dual implementations.
Adds SQLite profile handling (balanced and practical) so the benchmark matrix can compare multiple SQLite operating modes directly.
Adds scripts/run_tpcc_matrix.sh to run the performance matrix in one shot and write timestamped raw logs plus a summary file under tpcc/results/<timestamp>/.
Adds duplicate-key retry handling in the matrix runner for the known TPCC history.h_date collision pattern so a single backend can be rerun from a fresh database without restarting the full matrix.
Expands Makefile TPCC targets to cover the new backend/profile combinations.
Refreshes README.md and tpcc/README.md with the latest measured results and documents the benchmark runner and the duplicate-key caveat.

Latest 720s comparison currently documented in the branch:

KiteSQL LMDB: 53510 TpmC
KiteSQL RocksDB: 32248 TpmC
SQLite balanced: 36273 TpmC
SQLite practical: 35516 TpmC

Code changes

Has Rust code change
Has CI related scripts change

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Manual test:

cargo build -p tpcc --release
TPCC_DUPLICATE_RETRY=1 ./scripts/run_tpcc_matrix.sh

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Note for reviewer

This PR is large, but the changes cluster fairly cleanly. A good review order is:

execution/storage infrastructure changes
binder/expression/normalization behavior changes
TPCC backend split, runner, and docs

Representative files for each area:

execution/storage: src/execution/**, src/storage/**, src/db.rs
optimizer/binder: src/binder/select.rs, src/expression/mod.rs, src/catalog/column.rs, src/optimizer/rule/normalization/**, src/optimizer/heuristic/**
TPCC: tpcc/src/backend/**, tpcc/src/main.rs, scripts/run_tpcc_matrix.sh, tpcc/README.md, README.md

…ion fallback

- add LMDB storage backend - restore scalar subqueries in WHERE to join-aware binding - enforce scalar subquery cardinality at execution time - return NULL for empty scalar subqueries and error on multi-row results

…rowed scan hints

KKould added 22 commits March 28, 2026 04:55

refactor: optimizer normalization passes

b309351

refactor: normalization scheduling to avoid matcher scans

fccaf4c

chore: output_columns reduces the number of Vecs constructed

e7c31ed

refactor: derive column positions during binding instead of BindPosit…

92fe1b1

…ion fallback

refactor: avoid redundant operator clones in normalization rules

78b9d78

feat: add LMDB storage and fix scalar subquery semantics

3849cf5

- add LMDB storage backend - restore scalar subqueries in WHERE to join-aware binding - enforce scalar subquery cardinality at execution time - return NULL for empty scalar subqueries and error on multi-row results

refactor: borrow range bounds and reuse delete buffers

b86c909

refactor: make key encoding paths borrowed and reusable

da33716

refactor: storage buffers and trim optimizer column-reference overhead

5125703

refactor: column pruning to reuse borrowed column summaries

d122ba8

refactor: remove memo and choose physical impls directly

961e3b5

refactor: optimizer hint propagation and preindex implementation rules

d77f875

refactor: optimizer annotate pass to merge post-rules and restore bor…

ec39d35

…rowed scan hints

optimizer: avoid schema diff in column pruning remap

e4fb058

optimizer: avoid recursive expr walks in column pruning

6477e49

perf: prebind range comparison evaluators

fe02eac

refactior: new ExecArena-based executor model

8827abc

fix: tpcc segfault caused by bump-backed executor drop order

90aea99

refactor: reuse arena result tuple across execution pipeline

18aed5f

chore: uncheck for decode tuple primary key when query

f2de8d7

refactor(tpcc): split backends and improve benchmark tooling

7446641

refactor(execution): unify executor node API and trim test helpers

d4063b0

KKould force-pushed the refactor--hep_match branch from c9564fa to d4063b0 Compare March 27, 2026 20:55

KKould added 5 commits March 28, 2026 16:51

chore: codefmt

a65ce17

refactor(stats): load sketches alongside histograms in StatisticsMeta

0d7e0ef

feat: improve normalization rules and add tpcc benchmark runner

383e565

chore(release): bump kite_sql and macros crate to 0.2.0

e2c4e12

feat(cargo): gate native backends behind rocksdb/lmdb features

55e9ddf

KKould self-assigned this Mar 28, 2026

KKould added the enhancement New feature or request label Mar 28, 2026

KKould merged commit f77e71e into main Mar 28, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve execution pipeline and TPCC backends#313

refactor: improve execution pipeline and TPCC backends#313
KKould merged 27 commits intomainfrom
refactor--hep_match

KKould commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KKould commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

1. Execution / storage pipeline refactor

2. Binder / expression / normalization improvements

3. TPCC backend split, runner, and docs refresh

Code changes

Check List

Note for reviewer

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KKould commented Mar 27, 2026 •

edited

Loading