add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane#100
Closed
AkashKumar7902 wants to merge 10 commits intomainfrom
Closed
add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane#100AkashKumar7902 wants to merge 10 commits intomainfrom
AkashKumar7902 wants to merge 10 commits intomainfrom
Conversation
…egrations e2e lane This sample exists to support the keploy/integrations e2e lane that regresses the postgres-v3 session-fallback lifetime-gate fix (keploy/enterprise#1952). Each request to /lookup/<app_label>/<model>/ clears Django's in-process ContentType cache and forces the same `SELECT django_content_type WHERE app_label=$1 AND model=$2` query whose `class: APP` → `LifetimePerTest` tagging surfaced the bug end-to-end. The sample is intentionally tiny: - No admin, no DRF, no token auth — the bug shape doesn't need them. - No migrations beyond Django's stock contenttypes/auth tables — those are exactly what the lookup hits. - CONN_MAX_AGE=0 (a fresh connection per request) keeps the per- request DB call sequence deterministic across record/replay. Endpoints: GET /health/ → wait-for-app gate GET /lookup/<app_label>/<model>/ → ContentType.objects.get(...) + clear_cache to force the SQL on every request The keploy/integrations lane that drives this sample lives at .woodpecker/django-permission-cohort-postgres.yml — clones this repo on the corresponding branch, copies the sample dir into the run workspace, overlays the agent-side Dockerfile, and runs keploy record + replay across the three-binary matrix (record-build-replay-build / record-latest-replay-build / record-build-replay-latest). Refs: - keploy/enterprise#1952 RCA + reproduction - keploy/integrations#167 fix PR Signed-off-by: Akash Kumar <meakash7902@gmail.com>
Contributor
Author
|
Sibling PR: keploy/integrations#167 (the postgres-v3 session-fallback fix that needs this sample for its e2e lane). The lane in that PR clones this branch verbatim; once this merges to main, a follow-up commit on the integrations branch will drop the |
Pipeline runs of the keploy/integrations e2e lane that consumes this sample showed transient hangs during replay: gunicorn's default worker timeout (--timeout 60 in the prior version) was triggering SIGKILL on workers mid-request when keploy proxy roundtrips spiked under CI load, and the test that landed on the killed worker would time out client-side with `context deadline exceeded`. Subsequent tests after the worker recycle ran fine — classic shape of "matcher fast enough on average, but a single slow burst breaches the worker timeout". Two changes: 1. --timeout 300 (was 60). Keploy's matcher under CI agent load can take several seconds per query in pathological cases (cold cohort build, Track Y session-fallback probe, etc.); 300s gives the proxy headroom without tampering with the test gate (--api-timeout in the driver script remains the client-side cap on per-test wall time). 2. --workers 4 / --threads 2 (was 2/1). Two workers with one thread each meant only two concurrent requests; a single slow request blocked the next one in queue and the queue itself hit timeout. 4×2 gives 8 concurrent slots — enough that the lane's single-request-at-a-time replay never queues against itself. Also adds --graceful-timeout 30 so worker recycle on shutdown stays quick. Refs: - keploy/integrations#167 - keploy/integrations pipeline #839 (build-build cell, test-8/9 hang correlated with `[CRITICAL] WORKER TIMEOUT (pid:39)` gunicorn log) Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…SLRequest preamble
Pipeline runs of the keploy/integrations e2e lane that consumes this
sample showed a deterministic shape: every FIRST request to a new
endpoint hung 120s, then timed out client-side. Subsequent requests
to the same endpoint passed in <40ms. Across multiple new endpoints
in one replay, that compounds — three new endpoints means three
120s timeouts (#846 build-build cell: 6.54 min total runtime, 3 of
10 tests timed out).
The recorder log gave the lead:
v3 recorder V2: unexpected SSL response
response_byte: 82 (ASCII 'R')
psycopg2 (libpq under the hood) sends the SSLRequest preamble on
every fresh connection by default — even for clear-text postgres.
Postgres responds with 'R' (auth request) instead of 'S'/'N',
skipping SSL negotiation entirely. The keploy v3 recorder logs the
mismatch but accepts the recording. At replay, the proxy on the
first fresh connection stalls waiting for the SSL-handshake bytes
that aren't coming, the client request hangs, the test times out.
Setting sslmode=disable in DATABASES.OPTIONS makes psycopg2 skip
the preamble entirely. Connection flow becomes a clean cleartext
path the proxy already handles correctly across record + replay.
Refs:
- keploy/integrations#167
- keploy/integrations pipeline #846 (build-build cell, three
sequential 120s hangs all on first-request-to-new-endpoint)
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…p thread The keploy/integrations e2e lane that consumes this sample runs clean across record + replay but never exercises the postgres-v3 session-fallback path the lane is meant to regress against — 0 'session-tier read-only fallback served' lines in pipeline #850's green run. Cause: every DB query in the previous shape was fired inside an HTTP request, so every capture's request-timestamp landed inside its test's window and got attributed to that test's perTest cohort. At replay the perTest path served everything; the lane never reached the gate the lifetime fix is for. To produce the bug shape end-to-end the recording must capture *some* `SELECT django_content_type` invocations *between* HTTP test windows so the agent's lax-promotion path (FilterPerTestAndLaxPromotedTierAware in keploy/keploy:pkg/util.go) routes them into the session pool — that's the on-disk shape post-fix replay must serve via session-fallback. Anything fired from inside an HTTP request can't satisfy that constraint by construction. Solution: a Django AppConfig.ready() that spawns one daemon thread per gunicorn worker. The thread loops every BACKGROUND_LOOKUPS_INTERVAL_S seconds (default 3) firing ContentType.clear_cache() + ContentType.objects.get(...) against a rotating set of (app_label, model) targets — same SQL hash as the HTTP-driven lookups, different binds. The cadence is slow enough not to saturate gunicorn workers and fast enough that the exerciser's 6s inter-round pause captures multiple thread-fired lookups *between* HTTP test windows. Gating: BACKGROUND_LOOKUPS=1 in docker-compose.yml. Off by default so the sample stays usable for ad-hoc local Django testing. Refs: - keploy/integrations#167 - keploy/integrations pipeline #850 (lane green but 0 session-fallback served — lane was no longer falsifying) Signed-off-by: Akash Kumar <meakash7902@gmail.com>
Pipeline #863's same-binary cells passed all 10 tests and surfaced
no matcher misses (fix's behavior is correct), but the assertion
that the lane exercises the lifetime-gate fix's specific
PerTest→SessionFallback path was zero-hit. Cause:
- keploy `test` replays all captured HTTP cases in ~1 second.
- The previous BACKGROUND_LOOKUPS_INTERVAL_S=3 means at most ONE
thread-fire lands during the replay-window phase.
- Even when that lone fire lands inside a test window, the
dispatcher routes it via the engine's WindowSnapshot tier — and
most thread fires hit between-test or post-test phases, where
SessionTransactional serves directly from the session pool
without needing the PerTest engine's SessionFallback path.
Bumping the cadence to 0.3s gives ~3-5 thread-fires per replay
that land inside an HTTP test window, increasing the probability
that at least one of them hits a perTest cohort empty for its
hash and falls through to the SessionFallback gate the fix
unblocks.
The faster cadence is fine for the gunicorn worker pool —
ContentType.objects.get() is a single-row read on a four-row
table, sub-millisecond. Recording side: thread captures still
land between HTTP test windows during the exerciser's 6s pause,
so the lax-promotion shape that got captures into the session
pool is unaffected.
Refs:
- keploy/integrations#167
- keploy/integrations pipeline #863 (build-build cell green on
tests but check_lane_exercises_session_fallback fires)
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…tainer Move `python manage.py migrate` out of the api container's entrypoint into a one-shot `migrator` service. The api service now starts with the schema and ContentType rows already present, so the only DB traffic captured by keploy is the runtime /lookup/ query path. Why: Django's `post_migrate` signal fires `create_contenttypes`, which bulk-inserts ContentType rows in a single simple-query INSERT whose row-tuple ordering depends on model-registration timing during app boot. When the keploy proxy captures that INSERT, the recording is fragile: small boot-time shifts between record and replay (the BACKGROUND_LOOKUPS thread firing during migrate, GIL scheduling differences, etc) produce a hash-equal but value-divergent INSERT at replay time, and the matcher correctly rejects it. The keploy/ integrations build-build cell was halting at boot with this exact shape (`bind values diverged from every recorded invocation`, candidates=2, sessionFallback probed-empty). Migrations only need to populate static schema; running them in a sibling container that bypasses the keploy proxy (keploy intercepts api's traffic only) keeps the boot path out of mocks.yaml entirely. The `service_completed_successfully` dependency guarantees api doesn't start until migrate is done. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com>
…ling container" This reverts commit 3dbe4dd.
…inistic migration order Django's `post_migrate` signal fires `create_contenttypes`, which bulk-inserts ContentType rows from a model collection iterated in hash-seed-dependent order on some Django versions. Python's per- process random hash seed (default behavior since 3.3) means the INSERT row tuples can land in different orders across two runs of the same code, even on identical machines and identical container images. Under keploy record/replay this surfaces as: the recorder captures one row order; the replay-time live driver fires a different row order; SQL skeleton hashes match (same column count, same value arity) but the inline value tuples diverge, and the matcher correctly rejects the candidates as `bind values diverged from every recorded invocation`. The keploy/integrations build-build cell halted at boot with that exact shape. PYTHONHASHSEED=0 fixes the seed across runs so iteration order is deterministic. Affects only this sample app — the lookup endpoints themselves are already deterministic. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com> Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…pe-lookup thread The thread was added to scatter `SELECT django_content_type` captures across HTTP test windows so the keploy/integrations regression- coverage lane would hit the postgres-v3 session-fallback path. In practice the path's per-test cohort vs session-pool coverage was timing-dependent: a thread-fired capture for, say, (auth, user) may or may not land in the session pool on any given run. When the lane *did* hit session-fallback for a test whose required bind wasn't in the pool, pickSessionFallback's FIFO fallback served a wrong-bind candidate and the test failed on body diff — a separate matcher bug unrelated to the lifetime-gate fix this sample is paired with. The deterministic falsifying e2e for the lifetime-gate bug is the doccano lane in keploy/enterprise; this sample's job is straight Django+Postgres regression coverage. Drop the thread to keep the recordings deterministic. apps.py already gates the spawn on BACKGROUND_LOOKUPS=1 so removing the env var is a clean disable; the code stays in case the thread is ever needed again. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com> Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…falsifier via post-response side query Adds a daemon thread to the /lookup/ view that fires `SELECT django_content_type WHERE app_label='auth' AND model='user'` 100ms after the response is sent. This deliberately creates a record-vs- replay timing asymmetry: at record (exerciser pacing ~1s) the call lands between HTTP test windows and is captured into the session pool with LifetimePerTest (lax-promoted); at replay (keploy `test` compresses pacing to tens of ms) the same 100ms-delayed call lands INSIDE a later test's window where the perTest cohort holds a different bind, so the dispatcher routes PerTest -> SessionFallback. This is the path the lifetime-gate fix in keploy/integrations#167 unblocks. With the fix in the matcher, the gate accepts the lax- promoted candidate and the response is served. Without it, the gate rejects every PerTest-tagged candidate and the matcher logs `transactional: no invocation matched`. Used by the keploy/integrations django-permission-cohort-postgres lane to falsify the bug end-to-end on the cross-version build-latest cell — without this mechanism the lane is regression coverage but not falsifying. Signed-off-by: Akash Kumar <officialasishkumar@gmail.com> Signed-off-by: Akash Kumar <meakash7902@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane
TL;DR
Adds a tiny Django sample (
django-permission-cohort-postgres/) that firesSELECT django_content_type WHERE app_label=$1 AND model=$2on every request. The keploy/integrations e2e lane in keploy/integrations#167 clones this sample to regress the postgres-v3 session-fallback lifetime-gate fix that surfaced as the doccano CI failure (keploy/enterprise#1952).What the sample does
Two endpoints, both deliberately minimal:
GET /health/{"status":"ok"}— used as wait-for-app gateGET /lookup/<app_label>/<model>/ContentType.objects.clear_cache()thenContentType.objects.get(app_label=..., model=...)and returns the rowclear_cache()before the lookup forces every request to actually hit the database, so the SQL —SELECT django_content_type WHERE app_label=$1 AND model=$2 LIMIT 21— fires on every call. That query is whatpgmatch.Classifytagsclass: APP, whichDeriveLifetimethen maps toLifetimePerTest. That tagging is the precondition for the bug the integrations PR fixes.What's deliberately NOT in the sample
CONN_MAX_AGE=0(default) — fresh connection per request, so the per-request DB call sequence is deterministic across record and replayThe sample is small on purpose so that the failing query is the only DB traffic that matters at replay time. Anything bigger would dilute the regression target.
How to run it standalone
How keploy/integrations consumes it
The integrations lane (
keploy/integrations:.woodpecker/django-permission-cohort-postgres.yml) has acheckout-samplesstep that runs:The
--branchpin will be dropped in a follow-up commit on the integrations branch once this PR merges to main.The lane's driver script (
keploy/integrations:.ci/scripts/python/django-permission-cohort-postgres/docker.sh) readsSAMPLE_SRC_DIR=$WORKSPACE/samples-python/django-permission-cohort-postgres, copies the sample into a per-run staging dir, overlays the agent-sideDockerfile.v3agent+exerciser.shon top, and runskeploy record+keploy testacross the three-binary matrix (record-build-replay-build/record-latest-replay-build/record-build-replay-latest).Why a new sample rather than reusing
django-postgres/django-postgres/is a CRUD-shaped user-data app that exercises Django's ORM but doesn't naturally hitdjango_content_type(no permission framework, no generic relations, no admin in its request path). The bug being regressed is specifically about theContentTypelookup pattern — same shape as the doccano failure that motivated it — so the sample is purpose-built for that. Adding it as a sibling rather than re-fittingdjango-postgres/keeps both samples narrowly-scoped and avoids regressing whateverdjango-postgres/is currently used for.Test plan
docker compose upboots cleanly and responds 200 to/health/and/lookup/auth/user/record-build-replay-buildcellrecord-build-replay-latestcell stays red until the integrations fix releases downstream — that asymmetry is the bug-existence signal, called out in the lane's.ymlheader commentReferences