much cleaned up antithesis changes#36574
Draft
DAlperin wants to merge 21 commits into
Draft
Conversation
Adds an 'antithesis' build flavor to mzbuild and ci-builder, on the same axis as 'coverage' and 'sanitizer'. When enabled, the cargo build for binaries copied into mzbuild images uses the antithesis-sdk rustflags and ships full debug symbols for symbolization. Pulls antithesis-sdk into the workspace and into the storage, catalog, and persist-client member crates so subsequent commits can wrap panic and invariant sites in SDK assertions.
…is-sdk assertions
Adds SDK assertion hooks at known panic / invariant sites that
Antithesis fault injection can exercise:
- src/storage/src/source/kafka.rs (kafka source startup
and offset-known
invariants)
- src/storage/src/source/reclock.rs (reclock mint progress)
- src/storage/src/source/reclock/compat.rs (frontier shape on the
remap shard)
- src/storage/src/source/mysql/replication/partitions.rs
(GTID monotonicity in
the mysql source)
- src/storage/src/upsert/types.rs (upsert state-machine
invariants on
ancient/tombstoned
keys)
- src/storage/src/upsert_continual_feedback*.rs
(assert on tombstone
removal and key
rehydration paths)
- src/persist-client/src/internal/apply.rs (CaS monotonicity)
- src/catalog/src/durable/persist.rs (catalog epoch fencing)
Assertions are no-ops in non-antithesis builds.
Adds a 'pool-backed' execution mode to parallel_workload where the
Database wraps a pre-existing cluster (typically bootstrapped by an
external compose like Antithesis) instead of allocating its own.
- Database/Cluster gain 'existing_cluster_name' and 'is_pool_backed'
so framework-owned actions (CreateCluster, ResizeCluster,
ScaleCluster) skip pool clusters they don't own.
- 'name_scope' lets multiple parallel_workload invocations coexist
against the same Materialize without colliding on object names.
- mzcompose Clusterd treats scratch_directory=None as a real signal:
omit --scratch-directory entirely so clusterd falls back to
RocksDB's mem_env, matching production replica shape.
- Drop a handful of feature flags from the random-LD-flag pool that
parallel_workload no longer exercises cleanly.
Today, every per-action try/except site under Scenario.Kill /
Scenario.ZeroDowntimeDeploy is an unconditional swallow:
try:
source.create(exe)
…
except:
if exe.db.scenario not in (Scenario.Kill, Scenario.ZeroDowntimeDeploy):
raise
Under those scenarios, *every* exception is silently dropped — bare
`except:` catches AssertionError, KeyError, TypeError, the whole
mess. The intent was 'tolerate connection drops from the kill thread';
the implementation also tolerates real correctness bugs.
This commit adds is_fault_shaped(exc) — a predicate that returns True
only for messages matching connection-drop / DNS / broker-transport /
Mz-restart shapes (the things the kill thread actually produces) — and
threads it through every swallow site. Non-fault-shaped exceptions
re-raise as before.
Affected sites (action.py): SQLsmithAction (subprocess fail),
AlterIcebergSinkFromAction, AlterKafkaSinkFromAction, DropRoleAction,
DropClusterAction, DropClusterReplicaAction, GrantPrivilegesAction,
RevokePrivilegesAction, CreateKafkaSourceAction, CreateMySqlSourceAction,
CreatePostgresSourceAction, CreateSqlServerSourceAction, HttpPostAction
(2 sites). Plus executor.py's WS-executor connection-error handler.
Affects every parallel_workload consumer, not just the Antithesis
driver. Existing CI (`test/parallel-workload/mzcompose.py`) that
runs the framework's KillAction worker is the other consumer of these
swallows; the predicate's patterns cover the same shapes the KillAction
produces (connection drop on materialized restart), so behavior is
preserved for them while real exceptions stop being silently dropped.
Adds the build-side scaffolding needed to run an Antithesis test
of Materialize:
- test/antithesis/Makefile local build + push wrappers
- test/antithesis/mzcompose.py authoritative service graph
- test/antithesis/export-compose.py mzcompose.py -> docker-compose.yaml
- test/antithesis/export-env.py fingerprint .env generator
- test/antithesis/push-antithesis.py image push to the Antithesis registry
- test/antithesis/config/ mzbuild image carrying the exported
docker-compose.yaml + image refs
- test/antithesis/workload/ mzbuild image carrying the workload
runner, its dependencies, and a
lightweight stub of materialize/
mzcompose so workload code can
import it inside the container
- test/antithesis/fault-orchestrator/ quiet/active window orchestrator
that drives fault injection from
inside the test
- test/antithesis/AGENTS.md orientation for future contributors
No actual workload drivers, helpers, or properties yet — those come in
subsequent commits.
Wires the Antithesis build into the nightly Buildkite pipeline:
- ci/test/build-antithesis.sh builds antithesis-flavored
images and pushes them
- ci/nightly/pipeline.template.yml nightly job entry
- ci/mkpipeline.py / ci/test/build.py treat CI_ANTITHESIS as a
scoped build flavor on the
same axis as coverage and
sanitizer
- ci/test/lint-main/checks/
check-antithesis-compose.sh guards the exported
docker-compose.yaml against
drift from mzcompose.py
check-pipeline.sh adds the new check to lint
Adds the helper library shared by all workload drivers:
- helper_pg / helper_pg_source / helper_pg_upstream pg-side helpers
used both for the
Materialize SQL
client and for
the upstream
Postgres CDC
source
- helper_mysql / helper_mysql_source mysql-side
upstream helpers
and source DDL
- helper_kafka kafka topic /
producer helpers
- helper_none_source / helper_upsert_source envelope-shape
source helpers
- helper_testdrive in-container
testdrive runner
- helper_random / helper_table_mv / helper_source_stats
small utilities
used across
drivers
- helper_logging per-invocation
correlation IDs
and lifecycle
lines
- anytime_health_check.sh always-on
Antithesis
health probe
Adds the kafka-source and upsert workload drivers and their properties:
- first_select_upsert_implementation chooses
upsert_v1
or upsert_v2
for the
invocation
- parallel_driver_kafka_none_envelope envelope-NONE
no-data-loss /
no-duplication
- parallel_driver_upsert_latest_value key reflects
latest value
under fault
- singleton_driver_upsert_state_rehydration state
rehydrates
correctly
after a
clusterd
restart
- anytime_kafka_frontier_monotonic source
frontier
never
regresses
- anytime_kafka_offset_known_not_below_committed upstream
offset
invariant
- anytime_kafka_source_resumes_after_fault liveness
after broker
or clusterd
fault
Adds mysql-source workload drivers and their properties:
- first_mysql_replica_setup bootstraps the
upstream MySQL primary
and replica
- parallel_driver_mysql_cdc InnoDB CDC source
correctness under
concurrent DML
- parallel_driver_mysql_myisam MyISAM table behavior
(no-data-loss)
- anytime_mysql_source_no_gtid_errors GTID monotonicity at
the source level
Adds pg-source workload drivers:
- first_pg_cdc_setup bootstraps the upstream
Postgres + replication slot
- parallel_driver_pg_cdc pg-CDC correctness under
concurrent upstream writes
- singleton_driver_pg_cdc_testdrive runs a pg-cdc testdrive suite
once per invocation
- parallel_driver_parallel_workload drives the
existing
parallel_workload
library against
a per-invocation
pool-backed
cluster
- parallel_driver_upsert_ancient_key_writable cross-invocation
property:
ancient keys
remain writable
after long
quiescence
Drivers that pair workload-side observations with SUT-side assertion
anchors introduced in the storage/persist/catalog instrumentation
commit:
- singleton_driver_catalog_recovery_consistency catalog recovery
under
environmentd
fault
- parallel_driver_strict_serializable_reads persist
strict-serializable
read property
- parallel_driver_mv_reflects_table_updates materialized
views reflect
base-table writes
eventually
- anytime_fault_recovery_exercised liveness signal:
the SUT is
actually faulting
and recovering
…lizeInc#11200 / MaterializeInc#11224) Two parallel_driver shapes that collectively cover the two peek-sequencing variants of the read-hold downgrade bug: - first_explicit_txn_setup seed-private table bootstrap for the explicit-txn driver - parallel_driver_explicit_txn_no_since_violation BEGIN / 8x SELECT alternating table+MV / COMMIT; exercises the in_immediate_multi_stmt_txn path - first_pw_hot_objects_setup bootstrap for the hot-objects driver - helper_pw_hot shared helpers for the hot-objects driver - parallel_driver_pw_hot_objects many drivers racing against a small fixed object pool; exercises the single-statement peek-sequencing path
3594161 to
b8c5f80
Compare
…figs
Splits the single all-services compose into one configurable composition
that emits per-group docker-compose YAMLs:
- kafka kafka stack + multi-replica clusterd
- pg-cdc postgres-source + multi-replica clusterd
- mysql-cdc mysql primary+replica + multi-replica clusterd
- parallel-workload clusterd pool + multi-replica antithesis_cluster
- catalog single clusterd, no upstream sources
- combined kitchen sink (every service, every driver)
Why: Antithesis runs on a single core, so service-count is the main
competitive resource. Cutting unused upstreams (kafka, mysql, postgres)
out of stacks that don't exercise them gives the relevant workload more
hypervisor time and tightens per-property signal.
How:
- test/antithesis/groups.yaml is the single source of truth for
which services + setup + drivers + anytime scripts belong to each
group. anytime_health_check and anytime_fault_recovery_exercised
are auto-added to every group.
- test/antithesis/export-compose.py takes --group=NAME and filters
services + workload depends_on, injects ANTITHESIS_WORKLOAD_GROUP
on the workload service.
- The workload Dockerfile stages all scripts; the entrypoint reads
ANTITHESIS_WORKLOAD_GROUP and copies only the selected scripts
into /opt/antithesis/test/v1/materialize/ so Test Composer
doesn't see drivers that aren't in scope.
- One mzbuild config image per group (test/antithesis/configs/<group>/);
push-antithesis.py and ci/test/build.py iterate over them.
Bookmark dov/antithesis-stack still points at this commit; previous
HEAD becomes commit 12.
Land it; CI nightly schedule (run all groups every night vs. rotate)
is deferred — implementation is configurable via the buildkite step's
ANTITHESIS_WORKLOAD_GROUPS env var.
b8c5f80 to
bb94223
Compare
…ssion-drain signals
5a11c03 to
e4dcb46
Compare
e4dcb46 to
af3eae7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remove these sections if your commit already has a good description!
Motivation
Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.
If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.
Description
What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.
Verification
How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.