Skip to content

much cleaned up antithesis changes#36574

Draft
DAlperin wants to merge 21 commits into
MaterializeInc:mainfrom
DAlperin:dov/antithesis-stack
Draft

much cleaned up antithesis changes#36574
DAlperin wants to merge 21 commits into
MaterializeInc:mainfrom
DAlperin:dov/antithesis-stack

Conversation

@DAlperin
Copy link
Copy Markdown
Member

Remove these sections if your commit already has a good description!

Motivation

Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.

If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.

Description

What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.

Verification

How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.

DAlperin added 13 commits May 15, 2026 19:34
Adds an 'antithesis' build flavor to mzbuild and ci-builder, on the same
axis as 'coverage' and 'sanitizer'. When enabled, the cargo build for
binaries copied into mzbuild images uses the antithesis-sdk rustflags
and ships full debug symbols for symbolization.

Pulls antithesis-sdk into the workspace and into the storage, catalog,
and persist-client member crates so subsequent commits can wrap panic
and invariant sites in SDK assertions.
…is-sdk assertions

Adds SDK assertion hooks at known panic / invariant sites that
Antithesis fault injection can exercise:

  - src/storage/src/source/kafka.rs              (kafka source startup
                                                  and offset-known
                                                  invariants)
  - src/storage/src/source/reclock.rs            (reclock mint progress)
  - src/storage/src/source/reclock/compat.rs     (frontier shape on the
                                                  remap shard)
  - src/storage/src/source/mysql/replication/partitions.rs
                                                 (GTID monotonicity in
                                                  the mysql source)
  - src/storage/src/upsert/types.rs              (upsert state-machine
                                                  invariants on
                                                  ancient/tombstoned
                                                  keys)
  - src/storage/src/upsert_continual_feedback*.rs
                                                 (assert on tombstone
                                                  removal and key
                                                  rehydration paths)
  - src/persist-client/src/internal/apply.rs     (CaS monotonicity)
  - src/catalog/src/durable/persist.rs           (catalog epoch fencing)

Assertions are no-ops in non-antithesis builds.
Adds a 'pool-backed' execution mode to parallel_workload where the
Database wraps a pre-existing cluster (typically bootstrapped by an
external compose like Antithesis) instead of allocating its own.

  - Database/Cluster gain 'existing_cluster_name' and 'is_pool_backed'
    so framework-owned actions (CreateCluster, ResizeCluster,
    ScaleCluster) skip pool clusters they don't own.
  - 'name_scope' lets multiple parallel_workload invocations coexist
    against the same Materialize without colliding on object names.
  - mzcompose Clusterd treats scratch_directory=None as a real signal:
    omit --scratch-directory entirely so clusterd falls back to
    RocksDB's mem_env, matching production replica shape.
  - Drop a handful of feature flags from the random-LD-flag pool that
    parallel_workload no longer exercises cleanly.
Today, every per-action try/except site under Scenario.Kill /
Scenario.ZeroDowntimeDeploy is an unconditional swallow:

  try:
      source.create(exe)
      …
  except:
      if exe.db.scenario not in (Scenario.Kill, Scenario.ZeroDowntimeDeploy):
          raise

Under those scenarios, *every* exception is silently dropped — bare
`except:` catches AssertionError, KeyError, TypeError, the whole
mess. The intent was 'tolerate connection drops from the kill thread';
the implementation also tolerates real correctness bugs.

This commit adds is_fault_shaped(exc) — a predicate that returns True
only for messages matching connection-drop / DNS / broker-transport /
Mz-restart shapes (the things the kill thread actually produces) — and
threads it through every swallow site. Non-fault-shaped exceptions
re-raise as before.

Affected sites (action.py): SQLsmithAction (subprocess fail),
AlterIcebergSinkFromAction, AlterKafkaSinkFromAction, DropRoleAction,
DropClusterAction, DropClusterReplicaAction, GrantPrivilegesAction,
RevokePrivilegesAction, CreateKafkaSourceAction, CreateMySqlSourceAction,
CreatePostgresSourceAction, CreateSqlServerSourceAction, HttpPostAction
(2 sites). Plus executor.py's WS-executor connection-error handler.

Affects every parallel_workload consumer, not just the Antithesis
driver. Existing CI (`test/parallel-workload/mzcompose.py`) that
runs the framework's KillAction worker is the other consumer of these
swallows; the predicate's patterns cover the same shapes the KillAction
produces (connection drop on materialized restart), so behavior is
preserved for them while real exceptions stop being silently dropped.
Adds the build-side scaffolding needed to run an Antithesis test
of Materialize:

  - test/antithesis/Makefile               local build + push wrappers
  - test/antithesis/mzcompose.py           authoritative service graph
  - test/antithesis/export-compose.py      mzcompose.py -> docker-compose.yaml
  - test/antithesis/export-env.py          fingerprint .env generator
  - test/antithesis/push-antithesis.py     image push to the Antithesis registry
  - test/antithesis/config/                mzbuild image carrying the exported
                                           docker-compose.yaml + image refs
  - test/antithesis/workload/              mzbuild image carrying the workload
                                           runner, its dependencies, and a
                                           lightweight stub of materialize/
                                           mzcompose so workload code can
                                           import it inside the container
  - test/antithesis/fault-orchestrator/    quiet/active window orchestrator
                                           that drives fault injection from
                                           inside the test
  - test/antithesis/AGENTS.md              orientation for future contributors

No actual workload drivers, helpers, or properties yet — those come in
subsequent commits.
Wires the Antithesis build into the nightly Buildkite pipeline:

  - ci/test/build-antithesis.sh                builds antithesis-flavored
                                               images and pushes them
  - ci/nightly/pipeline.template.yml           nightly job entry
  - ci/mkpipeline.py / ci/test/build.py        treat CI_ANTITHESIS as a
                                               scoped build flavor on the
                                               same axis as coverage and
                                               sanitizer
  - ci/test/lint-main/checks/
      check-antithesis-compose.sh              guards the exported
                                               docker-compose.yaml against
                                               drift from mzcompose.py
      check-pipeline.sh                        adds the new check to lint
Adds the helper library shared by all workload drivers:

  - helper_pg / helper_pg_source / helper_pg_upstream  pg-side helpers
                                                       used both for the
                                                       Materialize SQL
                                                       client and for
                                                       the upstream
                                                       Postgres CDC
                                                       source
  - helper_mysql / helper_mysql_source                 mysql-side
                                                       upstream helpers
                                                       and source DDL
  - helper_kafka                                       kafka topic /
                                                       producer helpers
  - helper_none_source / helper_upsert_source          envelope-shape
                                                       source helpers
  - helper_testdrive                                   in-container
                                                       testdrive runner
  - helper_random / helper_table_mv / helper_source_stats
                                                       small utilities
                                                       used across
                                                       drivers
  - helper_logging                                     per-invocation
                                                       correlation IDs
                                                       and lifecycle
                                                       lines
  - anytime_health_check.sh                            always-on
                                                       Antithesis
                                                       health probe
Adds the kafka-source and upsert workload drivers and their properties:

  - first_select_upsert_implementation                   chooses
                                                         upsert_v1
                                                         or upsert_v2
                                                         for the
                                                         invocation
  - parallel_driver_kafka_none_envelope                  envelope-NONE
                                                         no-data-loss /
                                                         no-duplication
  - parallel_driver_upsert_latest_value                  key reflects
                                                         latest value
                                                         under fault
  - singleton_driver_upsert_state_rehydration            state
                                                         rehydrates
                                                         correctly
                                                         after a
                                                         clusterd
                                                         restart
  - anytime_kafka_frontier_monotonic                     source
                                                         frontier
                                                         never
                                                         regresses
  - anytime_kafka_offset_known_not_below_committed       upstream
                                                         offset
                                                         invariant
  - anytime_kafka_source_resumes_after_fault             liveness
                                                         after broker
                                                         or clusterd
                                                         fault
Adds mysql-source workload drivers and their properties:

  - first_mysql_replica_setup                  bootstraps the
                                               upstream MySQL primary
                                               and replica
  - parallel_driver_mysql_cdc                  InnoDB CDC source
                                               correctness under
                                               concurrent DML
  - parallel_driver_mysql_myisam               MyISAM table behavior
                                               (no-data-loss)
  - anytime_mysql_source_no_gtid_errors        GTID monotonicity at
                                               the source level
Adds pg-source workload drivers:

  - first_pg_cdc_setup                  bootstraps the upstream
                                        Postgres + replication slot
  - parallel_driver_pg_cdc              pg-CDC correctness under
                                        concurrent upstream writes
  - singleton_driver_pg_cdc_testdrive   runs a pg-cdc testdrive suite
                                        once per invocation
  - parallel_driver_parallel_workload                drives the
                                                     existing
                                                     parallel_workload
                                                     library against
                                                     a per-invocation
                                                     pool-backed
                                                     cluster
  - parallel_driver_upsert_ancient_key_writable      cross-invocation
                                                     property:
                                                     ancient keys
                                                     remain writable
                                                     after long
                                                     quiescence
Drivers that pair workload-side observations with SUT-side assertion
anchors introduced in the storage/persist/catalog instrumentation
commit:

  - singleton_driver_catalog_recovery_consistency  catalog recovery
                                                   under
                                                   environmentd
                                                   fault
  - parallel_driver_strict_serializable_reads      persist
                                                   strict-serializable
                                                   read property
  - parallel_driver_mv_reflects_table_updates      materialized
                                                   views reflect
                                                   base-table writes
                                                   eventually
  - anytime_fault_recovery_exercised               liveness signal:
                                                   the SUT is
                                                   actually faulting
                                                   and recovering
…lizeInc#11200 / MaterializeInc#11224)

Two parallel_driver shapes that collectively cover the two
peek-sequencing variants of the read-hold downgrade bug:

  - first_explicit_txn_setup                              seed-private
                                                          table
                                                          bootstrap
                                                          for the
                                                          explicit-txn
                                                          driver
  - parallel_driver_explicit_txn_no_since_violation       BEGIN / 8x
                                                          SELECT
                                                          alternating
                                                          table+MV /
                                                          COMMIT;
                                                          exercises the
                                                          in_immediate_multi_stmt_txn
                                                          path
  - first_pw_hot_objects_setup                            bootstrap for
                                                          the
                                                          hot-objects
                                                          driver
  - helper_pw_hot                                         shared helpers
                                                          for the
                                                          hot-objects
                                                          driver
  - parallel_driver_pw_hot_objects                        many drivers
                                                          racing against
                                                          a small fixed
                                                          object pool;
                                                          exercises the
                                                          single-statement
                                                          peek-sequencing
                                                          path
@DAlperin DAlperin force-pushed the dov/antithesis-stack branch 3 times, most recently from 3594161 to b8c5f80 Compare May 17, 2026 04:12
…figs

Splits the single all-services compose into one configurable composition
that emits per-group docker-compose YAMLs:

  - kafka              kafka stack + multi-replica clusterd
  - pg-cdc             postgres-source + multi-replica clusterd
  - mysql-cdc          mysql primary+replica + multi-replica clusterd
  - parallel-workload  clusterd pool + multi-replica antithesis_cluster
  - catalog            single clusterd, no upstream sources
  - combined           kitchen sink (every service, every driver)

Why: Antithesis runs on a single core, so service-count is the main
competitive resource. Cutting unused upstreams (kafka, mysql, postgres)
out of stacks that don't exercise them gives the relevant workload more
hypervisor time and tightens per-property signal.

How:
  - test/antithesis/groups.yaml is the single source of truth for
    which services + setup + drivers + anytime scripts belong to each
    group. anytime_health_check and anytime_fault_recovery_exercised
    are auto-added to every group.
  - test/antithesis/export-compose.py takes --group=NAME and filters
    services + workload depends_on, injects ANTITHESIS_WORKLOAD_GROUP
    on the workload service.
  - The workload Dockerfile stages all scripts; the entrypoint reads
    ANTITHESIS_WORKLOAD_GROUP and copies only the selected scripts
    into /opt/antithesis/test/v1/materialize/ so Test Composer
    doesn't see drivers that aren't in scope.
  - One mzbuild config image per group (test/antithesis/configs/<group>/);
    push-antithesis.py and ci/test/build.py iterate over them.

Bookmark dov/antithesis-stack still points at this commit; previous
HEAD becomes commit 12.

Land it; CI nightly schedule (run all groups every night vs. rotate)
is deferred — implementation is configurable via the buildkite step's
ANTITHESIS_WORKLOAD_GROUPS env var.
@DAlperin DAlperin force-pushed the dov/antithesis-stack branch from b8c5f80 to bb94223 Compare May 17, 2026 15:30
@DAlperin DAlperin force-pushed the dov/antithesis-stack branch 3 times, most recently from 5a11c03 to e4dcb46 Compare May 19, 2026 00:20
@DAlperin DAlperin force-pushed the dov/antithesis-stack branch from e4dcb46 to af3eae7 Compare May 19, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant