add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane by AkashKumar7902 · Pull Request #100 · keploy/samples-python

AkashKumar7902 · 2026-04-29T13:04:09Z

add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane

TL;DR

Adds a tiny Django sample (django-permission-cohort-postgres/) that fires SELECT django_content_type WHERE app_label=$1 AND model=$2 on every request. The keploy/integrations e2e lane in keploy/integrations#167 clones this sample to regress the postgres-v3 session-fallback lifetime-gate fix that surfaced as the doccano CI failure (keploy/enterprise#1952).

What the sample does

Two endpoints, both deliberately minimal:

Path	Effect
`GET /health/`	`{"status":"ok"}` — used as wait-for-app gate
`GET /lookup/<app_label>/<model>/`	Calls `ContentType.objects.clear_cache()` then `ContentType.objects.get(app_label=..., model=...)` and returns the row

clear_cache() before the lookup forces every request to actually hit the database, so the SQL — SELECT django_content_type WHERE app_label=$1 AND model=$2 LIMIT 21 — fires on every call. That query is what pgmatch.Classify tags class: APP, which DeriveLifetime then maps to LifetimePerTest. That tagging is the precondition for the bug the integrations PR fixes.

What's deliberately NOT in the sample

No admin
No DRF / token / session auth
No static files, no templates, no signal handlers, no celery
No migrations beyond Django's stock contenttypes / auth tables (those are exactly what the lookup hits)
CONN_MAX_AGE=0 (default) — fresh connection per request, so the per-request DB call sequence is deterministic across record and replay

The sample is small on purpose so that the failing query is the only DB traffic that matters at replay time. Anything bigger would dilute the regression target.

How to run it standalone

docker compose up
curl http://localhost:8080/health/                            # → {"status":"ok"}
curl http://localhost:8080/lookup/auth/user/                  # → {"id":1,"app_label":"auth","model":"user"}
curl http://localhost:8080/lookup/contenttypes/contenttype/   # → {"id":...,"app_label":"contenttypes","model":"contenttype"}

How keploy/integrations consumes it

The integrations lane (keploy/integrations:.woodpecker/django-permission-cohort-postgres.yml) has a checkout-samples step that runs:

git clone --depth 1 --branch add/django-permission-cohort-postgres \
  https://github.com/keploy/samples-python.git samples-python

The --branch pin will be dropped in a follow-up commit on the integrations branch once this PR merges to main.

The lane's driver script (keploy/integrations:.ci/scripts/python/django-permission-cohort-postgres/docker.sh) reads SAMPLE_SRC_DIR=$WORKSPACE/samples-python/django-permission-cohort-postgres, copies the sample into a per-run staging dir, overlays the agent-side Dockerfile.v3agent + exerciser.sh on top, and runs keploy record + keploy test across the three-binary matrix (record-build-replay-build / record-latest-replay-build / record-build-replay-latest).

Why a new sample rather than reusing `django-postgres/`

django-postgres/ is a CRUD-shaped user-data app that exercises Django's ORM but doesn't naturally hit django_content_type (no permission framework, no generic relations, no admin in its request path). The bug being regressed is specifically about the ContentType lookup pattern — same shape as the doccano failure that motivated it — so the sample is purpose-built for that. Adding it as a sibling rather than re-fitting django-postgres/ keeps both samples narrowly-scoped and avoids regressing whatever django-postgres/ is currently used for.

Test plan

Standalone docker compose up boots cleanly and responds 200 to /health/ and /lookup/auth/user/
keploy/integrations#167 lane (depends on this PR's branch) goes green for the record-build-replay-build cell
record-build-replay-latest cell stays red until the integrations fix releases downstream — that asymmetry is the bug-existence signal, called out in the lane's .yml header comment

References

keploy/integrations#167 — the integrations PR that consumes this sample
keploy/enterprise#1952 — RCA of the bug being regressed
keploy/enterprise#1889 — original failing pipeline

…egrations e2e lane This sample exists to support the keploy/integrations e2e lane that regresses the postgres-v3 session-fallback lifetime-gate fix (keploy/enterprise#1952). Each request to /lookup/<app_label>/<model>/ clears Django's in-process ContentType cache and forces the same `SELECT django_content_type WHERE app_label=$1 AND model=$2` query whose `class: APP` → `LifetimePerTest` tagging surfaced the bug end-to-end. The sample is intentionally tiny: - No admin, no DRF, no token auth — the bug shape doesn't need them. - No migrations beyond Django's stock contenttypes/auth tables — those are exactly what the lookup hits. - CONN_MAX_AGE=0 (a fresh connection per request) keeps the per- request DB call sequence deterministic across record/replay. Endpoints: GET /health/ → wait-for-app gate GET /lookup/<app_label>/<model>/ → ContentType.objects.get(...) + clear_cache to force the SQL on every request The keploy/integrations lane that drives this sample lives at .woodpecker/django-permission-cohort-postgres.yml — clones this repo on the corresponding branch, copies the sample dir into the run workspace, overlays the agent-side Dockerfile, and runs keploy record + replay across the three-binary matrix (record-build-replay-build / record-latest-replay-build / record-build-replay-latest). Refs: - keploy/enterprise#1952 RCA + reproduction - keploy/integrations#167 fix PR Signed-off-by: Akash Kumar <meakash7902@gmail.com>

AkashKumar7902 · 2026-04-29T13:04:26Z

Sibling PR: keploy/integrations#167 (the postgres-v3 session-fallback fix that needs this sample for its e2e lane). The lane in that PR clones this branch verbatim; once this merges to main, a follow-up commit on the integrations branch will drop the --branch pin.

Pipeline runs of the keploy/integrations e2e lane that consumes this sample showed transient hangs during replay: gunicorn's default worker timeout (--timeout 60 in the prior version) was triggering SIGKILL on workers mid-request when keploy proxy roundtrips spiked under CI load, and the test that landed on the killed worker would time out client-side with `context deadline exceeded`. Subsequent tests after the worker recycle ran fine — classic shape of "matcher fast enough on average, but a single slow burst breaches the worker timeout". Two changes: 1. --timeout 300 (was 60). Keploy's matcher under CI agent load can take several seconds per query in pathological cases (cold cohort build, Track Y session-fallback probe, etc.); 300s gives the proxy headroom without tampering with the test gate (--api-timeout in the driver script remains the client-side cap on per-test wall time). 2. --workers 4 / --threads 2 (was 2/1). Two workers with one thread each meant only two concurrent requests; a single slow request blocked the next one in queue and the queue itself hit timeout. 4×2 gives 8 concurrent slots — enough that the lane's single-request-at-a-time replay never queues against itself. Also adds --graceful-timeout 30 so worker recycle on shutdown stays quick. Refs: - keploy/integrations#167 - keploy/integrations pipeline #839 (build-build cell, test-8/9 hang correlated with `[CRITICAL] WORKER TIMEOUT (pid:39)` gunicorn log) Signed-off-by: Akash Kumar <meakash7902@gmail.com>

…SLRequest preamble Pipeline runs of the keploy/integrations e2e lane that consumes this sample showed a deterministic shape: every FIRST request to a new endpoint hung 120s, then timed out client-side. Subsequent requests to the same endpoint passed in <40ms. Across multiple new endpoints in one replay, that compounds — three new endpoints means three 120s timeouts (#846 build-build cell: 6.54 min total runtime, 3 of 10 tests timed out). The recorder log gave the lead: v3 recorder V2: unexpected SSL response response_byte: 82 (ASCII 'R') psycopg2 (libpq under the hood) sends the SSLRequest preamble on every fresh connection by default — even for clear-text postgres. Postgres responds with 'R' (auth request) instead of 'S'/'N', skipping SSL negotiation entirely. The keploy v3 recorder logs the mismatch but accepts the recording. At replay, the proxy on the first fresh connection stalls waiting for the SSL-handshake bytes that aren't coming, the client request hangs, the test times out. Setting sslmode=disable in DATABASES.OPTIONS makes psycopg2 skip the preamble entirely. Connection flow becomes a clean cleartext path the proxy already handles correctly across record + replay. Refs: - keploy/integrations#167 - keploy/integrations pipeline #846 (build-build cell, three sequential 120s hangs all on first-request-to-new-endpoint) Signed-off-by: Akash Kumar <meakash7902@gmail.com>

…p thread The keploy/integrations e2e lane that consumes this sample runs clean across record + replay but never exercises the postgres-v3 session-fallback path the lane is meant to regress against — 0 'session-tier read-only fallback served' lines in pipeline #850's green run. Cause: every DB query in the previous shape was fired inside an HTTP request, so every capture's request-timestamp landed inside its test's window and got attributed to that test's perTest cohort. At replay the perTest path served everything; the lane never reached the gate the lifetime fix is for. To produce the bug shape end-to-end the recording must capture *some* `SELECT django_content_type` invocations *between* HTTP test windows so the agent's lax-promotion path (FilterPerTestAndLaxPromotedTierAware in keploy/keploy:pkg/util.go) routes them into the session pool — that's the on-disk shape post-fix replay must serve via session-fallback. Anything fired from inside an HTTP request can't satisfy that constraint by construction. Solution: a Django AppConfig.ready() that spawns one daemon thread per gunicorn worker. The thread loops every BACKGROUND_LOOKUPS_INTERVAL_S seconds (default 3) firing ContentType.clear_cache() + ContentType.objects.get(...) against a rotating set of (app_label, model) targets — same SQL hash as the HTTP-driven lookups, different binds. The cadence is slow enough not to saturate gunicorn workers and fast enough that the exerciser's 6s inter-round pause captures multiple thread-fired lookups *between* HTTP test windows. Gating: BACKGROUND_LOOKUPS=1 in docker-compose.yml. Off by default so the sample stays usable for ad-hoc local Django testing. Refs: - keploy/integrations#167 - keploy/integrations pipeline #850 (lane green but 0 session-fallback served — lane was no longer falsifying) Signed-off-by: Akash Kumar <meakash7902@gmail.com>

Pipeline #863's same-binary cells passed all 10 tests and surfaced no matcher misses (fix's behavior is correct), but the assertion that the lane exercises the lifetime-gate fix's specific PerTest→SessionFallback path was zero-hit. Cause: - keploy `test` replays all captured HTTP cases in ~1 second. - The previous BACKGROUND_LOOKUPS_INTERVAL_S=3 means at most ONE thread-fire lands during the replay-window phase. - Even when that lone fire lands inside a test window, the dispatcher routes it via the engine's WindowSnapshot tier — and most thread fires hit between-test or post-test phases, where SessionTransactional serves directly from the session pool without needing the PerTest engine's SessionFallback path. Bumping the cadence to 0.3s gives ~3-5 thread-fires per replay that land inside an HTTP test window, increasing the probability that at least one of them hits a perTest cohort empty for its hash and falls through to the SessionFallback gate the fix unblocks. The faster cadence is fine for the gunicorn worker pool — ContentType.objects.get() is a single-row read on a four-row table, sub-millisecond. Recording side: thread captures still land between HTTP test windows during the exerciser's 6s pause, so the lax-promotion shape that got captures into the session pool is unaffected. Refs: - keploy/integrations#167 - keploy/integrations pipeline #863 (build-build cell green on tests but check_lane_exercises_session_fallback fires) Signed-off-by: Akash Kumar <meakash7902@gmail.com>

…tainer Move `python manage.py migrate` out of the api container's entrypoint into a one-shot `migrator` service. The api service now starts with the schema and ContentType rows already present, so the only DB traffic captured by keploy is the runtime /lookup/ query path. Why: Django's `post_migrate` signal fires `create_contenttypes`, which bulk-inserts ContentType rows in a single simple-query INSERT whose row-tuple ordering depends on model-registration timing during app boot. When the keploy proxy captures that INSERT, the recording is fragile: small boot-time shifts between record and replay (the BACKGROUND_LOOKUPS thread firing during migrate, GIL scheduling differences, etc) produce a hash-equal but value-divergent INSERT at replay time, and the matcher correctly rejects it. The keploy/ integrations build-build cell was halting at boot with this exact shape (`bind values diverged from every recorded invocation`, candidates=2, sessionFallback probed-empty). Migrations only need to populate static schema; running them in a sibling container that bypasses the keploy proxy (keploy intercepts api's traffic only) keeps the boot path out of mocks.yaml entirely. The `service_completed_successfully` dependency guarantees api doesn't start until migrate is done. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com>

…ling container" This reverts commit 3dbe4dd.

…inistic migration order Django's `post_migrate` signal fires `create_contenttypes`, which bulk-inserts ContentType rows from a model collection iterated in hash-seed-dependent order on some Django versions. Python's per- process random hash seed (default behavior since 3.3) means the INSERT row tuples can land in different orders across two runs of the same code, even on identical machines and identical container images. Under keploy record/replay this surfaces as: the recorder captures one row order; the replay-time live driver fires a different row order; SQL skeleton hashes match (same column count, same value arity) but the inline value tuples diverge, and the matcher correctly rejects the candidates as `bind values diverged from every recorded invocation`. The keploy/integrations build-build cell halted at boot with that exact shape. PYTHONHASHSEED=0 fixes the seed across runs so iteration order is deterministic. Affects only this sample app — the lookup endpoints themselves are already deterministic. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com> Signed-off-by: Akash Kumar <meakash7902@gmail.com>

…pe-lookup thread The thread was added to scatter `SELECT django_content_type` captures across HTTP test windows so the keploy/integrations regression- coverage lane would hit the postgres-v3 session-fallback path. In practice the path's per-test cohort vs session-pool coverage was timing-dependent: a thread-fired capture for, say, (auth, user) may or may not land in the session pool on any given run. When the lane *did* hit session-fallback for a test whose required bind wasn't in the pool, pickSessionFallback's FIFO fallback served a wrong-bind candidate and the test failed on body diff — a separate matcher bug unrelated to the lifetime-gate fix this sample is paired with. The deterministic falsifying e2e for the lifetime-gate bug is the doccano lane in keploy/enterprise; this sample's job is straight Django+Postgres regression coverage. Drop the thread to keep the recordings deterministic. apps.py already gates the spawn on BACKGROUND_LOOKUPS=1 so removing the env var is a clean disable; the code stays in case the thread is ever needed again. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com> Signed-off-by: Akash Kumar <meakash7902@gmail.com>

…falsifier via post-response side query Adds a daemon thread to the /lookup/ view that fires `SELECT django_content_type WHERE app_label='auth' AND model='user'` 100ms after the response is sent. This deliberately creates a record-vs- replay timing asymmetry: at record (exerciser pacing ~1s) the call lands between HTTP test windows and is captured into the session pool with LifetimePerTest (lax-promoted); at replay (keploy `test` compresses pacing to tens of ms) the same 100ms-delayed call lands INSIDE a later test's window where the perTest cohort holds a different bind, so the dispatcher routes PerTest -> SessionFallback. This is the path the lifetime-gate fix in keploy/integrations#167 unblocks. With the fix in the matcher, the gate accepts the lax- promoted candidate and the response is served. Without it, the gate rejects every PerTest-tagged candidate and the matcher logs `transactional: no invocation matched`. Used by the keploy/integrations django-permission-cohort-postgres lane to falsify the bug end-to-end on the cross-version build-latest cell — without this mechanism the lane is regression coverage but not falsifying. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com> Signed-off-by: Akash Kumar <meakash7902@gmail.com>

AkashKumar7902 added 9 commits April 29, 2026 18:38

Revert "fix(django-permission-cohort-postgres): run migrations in sib…

be7f62f

…ling container" This reverts commit 3dbe4dd.

AkashKumar7902 closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane#100

add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane#100
AkashKumar7902 wants to merge 10 commits intomainfrom
add/django-permission-cohort-postgres

AkashKumar7902 commented Apr 29, 2026

Uh oh!

AkashKumar7902 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AkashKumar7902 commented Apr 29, 2026

add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane

TL;DR

What the sample does

What's deliberately NOT in the sample

How to run it standalone

How keploy/integrations consumes it

Why a new sample rather than reusing django-postgres/

Test plan

References

Uh oh!

AkashKumar7902 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why a new sample rather than reusing `django-postgres/`