Skip to content

add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane#100

Closed
AkashKumar7902 wants to merge 10 commits intomainfrom
add/django-permission-cohort-postgres
Closed

add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane#100
AkashKumar7902 wants to merge 10 commits intomainfrom
add/django-permission-cohort-postgres

Conversation

@AkashKumar7902
Copy link
Copy Markdown
Contributor

add(django-permission-cohort-postgres): minimal sample for keploy/integrations e2e lane

TL;DR

Adds a tiny Django sample (django-permission-cohort-postgres/) that fires SELECT django_content_type WHERE app_label=$1 AND model=$2 on every request. The keploy/integrations e2e lane in keploy/integrations#167 clones this sample to regress the postgres-v3 session-fallback lifetime-gate fix that surfaced as the doccano CI failure (keploy/enterprise#1952).

What the sample does

Two endpoints, both deliberately minimal:

Path Effect
GET /health/ {"status":"ok"} — used as wait-for-app gate
GET /lookup/<app_label>/<model>/ Calls ContentType.objects.clear_cache() then ContentType.objects.get(app_label=..., model=...) and returns the row

clear_cache() before the lookup forces every request to actually hit the database, so the SQL — SELECT django_content_type WHERE app_label=$1 AND model=$2 LIMIT 21 — fires on every call. That query is what pgmatch.Classify tags class: APP, which DeriveLifetime then maps to LifetimePerTest. That tagging is the precondition for the bug the integrations PR fixes.

What's deliberately NOT in the sample

  • No admin
  • No DRF / token / session auth
  • No static files, no templates, no signal handlers, no celery
  • No migrations beyond Django's stock contenttypes / auth tables (those are exactly what the lookup hits)
  • CONN_MAX_AGE=0 (default) — fresh connection per request, so the per-request DB call sequence is deterministic across record and replay

The sample is small on purpose so that the failing query is the only DB traffic that matters at replay time. Anything bigger would dilute the regression target.

How to run it standalone

docker compose up
curl http://localhost:8080/health/                            # → {"status":"ok"}
curl http://localhost:8080/lookup/auth/user/                  # → {"id":1,"app_label":"auth","model":"user"}
curl http://localhost:8080/lookup/contenttypes/contenttype/   # → {"id":...,"app_label":"contenttypes","model":"contenttype"}

How keploy/integrations consumes it

The integrations lane (keploy/integrations:.woodpecker/django-permission-cohort-postgres.yml) has a checkout-samples step that runs:

git clone --depth 1 --branch add/django-permission-cohort-postgres \
  https://github.com/keploy/samples-python.git samples-python

The --branch pin will be dropped in a follow-up commit on the integrations branch once this PR merges to main.

The lane's driver script (keploy/integrations:.ci/scripts/python/django-permission-cohort-postgres/docker.sh) reads SAMPLE_SRC_DIR=$WORKSPACE/samples-python/django-permission-cohort-postgres, copies the sample into a per-run staging dir, overlays the agent-side Dockerfile.v3agent + exerciser.sh on top, and runs keploy record + keploy test across the three-binary matrix (record-build-replay-build / record-latest-replay-build / record-build-replay-latest).

Why a new sample rather than reusing django-postgres/

django-postgres/ is a CRUD-shaped user-data app that exercises Django's ORM but doesn't naturally hit django_content_type (no permission framework, no generic relations, no admin in its request path). The bug being regressed is specifically about the ContentType lookup pattern — same shape as the doccano failure that motivated it — so the sample is purpose-built for that. Adding it as a sibling rather than re-fitting django-postgres/ keeps both samples narrowly-scoped and avoids regressing whatever django-postgres/ is currently used for.

Test plan

  • Standalone docker compose up boots cleanly and responds 200 to /health/ and /lookup/auth/user/
  • keploy/integrations#167 lane (depends on this PR's branch) goes green for the record-build-replay-build cell
  • record-build-replay-latest cell stays red until the integrations fix releases downstream — that asymmetry is the bug-existence signal, called out in the lane's .yml header comment

References

…egrations e2e lane

This sample exists to support the keploy/integrations e2e lane that
regresses the postgres-v3 session-fallback lifetime-gate fix
(keploy/enterprise#1952). Each request to
/lookup/<app_label>/<model>/ clears Django's in-process ContentType
cache and forces the same `SELECT django_content_type WHERE
app_label=$1 AND model=$2` query whose `class: APP` →
`LifetimePerTest` tagging surfaced the bug end-to-end.

The sample is intentionally tiny:

  - No admin, no DRF, no token auth — the bug shape doesn't need them.
  - No migrations beyond Django's stock contenttypes/auth tables —
    those are exactly what the lookup hits.
  - CONN_MAX_AGE=0 (a fresh connection per request) keeps the per-
    request DB call sequence deterministic across record/replay.

Endpoints:

  GET /health/                          → wait-for-app gate
  GET /lookup/<app_label>/<model>/      → ContentType.objects.get(...)
                                          + clear_cache to force the
                                          SQL on every request

The keploy/integrations lane that drives this sample lives at
.woodpecker/django-permission-cohort-postgres.yml — clones this repo
on the corresponding branch, copies the sample dir into the run
workspace, overlays the agent-side Dockerfile, and runs keploy record
+ replay across the three-binary matrix
(record-build-replay-build / record-latest-replay-build /
record-build-replay-latest).

Refs:
  - keploy/enterprise#1952  RCA + reproduction
  - keploy/integrations#167 fix PR
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
@AkashKumar7902
Copy link
Copy Markdown
Contributor Author

Sibling PR: keploy/integrations#167 (the postgres-v3 session-fallback fix that needs this sample for its e2e lane). The lane in that PR clones this branch verbatim; once this merges to main, a follow-up commit on the integrations branch will drop the --branch pin.

Pipeline runs of the keploy/integrations e2e lane that consumes this
sample showed transient hangs during replay: gunicorn's default
worker timeout (--timeout 60 in the prior version) was triggering
SIGKILL on workers mid-request when keploy proxy roundtrips spiked
under CI load, and the test that landed on the killed worker would
time out client-side with `context deadline exceeded`. Subsequent
tests after the worker recycle ran fine — classic shape of "matcher
fast enough on average, but a single slow burst breaches the worker
timeout".

Two changes:

1. --timeout 300 (was 60). Keploy's matcher under CI agent load can
   take several seconds per query in pathological cases (cold cohort
   build, Track Y session-fallback probe, etc.); 300s gives the
   proxy headroom without tampering with the test gate
   (--api-timeout in the driver script remains the client-side cap
   on per-test wall time).

2. --workers 4 / --threads 2 (was 2/1). Two workers with one thread
   each meant only two concurrent requests; a single slow request
   blocked the next one in queue and the queue itself hit timeout.
   4×2 gives 8 concurrent slots — enough that the lane's
   single-request-at-a-time replay never queues against itself.

Also adds --graceful-timeout 30 so worker recycle on shutdown stays
quick.

Refs:
  - keploy/integrations#167
  - keploy/integrations pipeline #839 (build-build cell, test-8/9 hang
    correlated with `[CRITICAL] WORKER TIMEOUT (pid:39)` gunicorn log)
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…SLRequest preamble

Pipeline runs of the keploy/integrations e2e lane that consumes this
sample showed a deterministic shape: every FIRST request to a new
endpoint hung 120s, then timed out client-side. Subsequent requests
to the same endpoint passed in <40ms. Across multiple new endpoints
in one replay, that compounds — three new endpoints means three
120s timeouts (#846 build-build cell: 6.54 min total runtime, 3 of
10 tests timed out).

The recorder log gave the lead:

  v3 recorder V2: unexpected SSL response
    response_byte: 82 (ASCII 'R')

psycopg2 (libpq under the hood) sends the SSLRequest preamble on
every fresh connection by default — even for clear-text postgres.
Postgres responds with 'R' (auth request) instead of 'S'/'N',
skipping SSL negotiation entirely. The keploy v3 recorder logs the
mismatch but accepts the recording. At replay, the proxy on the
first fresh connection stalls waiting for the SSL-handshake bytes
that aren't coming, the client request hangs, the test times out.

Setting sslmode=disable in DATABASES.OPTIONS makes psycopg2 skip
the preamble entirely. Connection flow becomes a clean cleartext
path the proxy already handles correctly across record + replay.

Refs:
  - keploy/integrations#167
  - keploy/integrations pipeline #846 (build-build cell, three
    sequential 120s hangs all on first-request-to-new-endpoint)
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…p thread

The keploy/integrations e2e lane that consumes this sample runs
clean across record + replay but never exercises the postgres-v3
session-fallback path the lane is meant to regress against —
0 'session-tier read-only fallback served' lines in pipeline #850's
green run. Cause: every DB query in the previous shape was fired
inside an HTTP request, so every capture's request-timestamp landed
inside its test's window and got attributed to that test's perTest
cohort. At replay the perTest path served everything; the lane
never reached the gate the lifetime fix is for.

To produce the bug shape end-to-end the recording must capture
*some* `SELECT django_content_type` invocations *between* HTTP test
windows so the agent's lax-promotion path
(FilterPerTestAndLaxPromotedTierAware in keploy/keploy:pkg/util.go)
routes them into the session pool — that's the on-disk shape
post-fix replay must serve via session-fallback. Anything fired
from inside an HTTP request can't satisfy that constraint by
construction.

Solution: a Django AppConfig.ready() that spawns one daemon thread
per gunicorn worker. The thread loops every BACKGROUND_LOOKUPS_INTERVAL_S
seconds (default 3) firing ContentType.clear_cache() +
ContentType.objects.get(...) against a rotating set of
(app_label, model) targets — same SQL hash as the HTTP-driven
lookups, different binds. The cadence is slow enough not to
saturate gunicorn workers and fast enough that the exerciser's 6s
inter-round pause captures multiple thread-fired lookups *between*
HTTP test windows.

Gating: BACKGROUND_LOOKUPS=1 in docker-compose.yml. Off by default
so the sample stays usable for ad-hoc local Django testing.

Refs:
  - keploy/integrations#167
  - keploy/integrations pipeline #850 (lane green but 0 session-fallback
    served — lane was no longer falsifying)
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
Pipeline #863's same-binary cells passed all 10 tests and surfaced
no matcher misses (fix's behavior is correct), but the assertion
that the lane exercises the lifetime-gate fix's specific
PerTest→SessionFallback path was zero-hit. Cause:

  - keploy `test` replays all captured HTTP cases in ~1 second.
  - The previous BACKGROUND_LOOKUPS_INTERVAL_S=3 means at most ONE
    thread-fire lands during the replay-window phase.
  - Even when that lone fire lands inside a test window, the
    dispatcher routes it via the engine's WindowSnapshot tier — and
    most thread fires hit between-test or post-test phases, where
    SessionTransactional serves directly from the session pool
    without needing the PerTest engine's SessionFallback path.

Bumping the cadence to 0.3s gives ~3-5 thread-fires per replay
that land inside an HTTP test window, increasing the probability
that at least one of them hits a perTest cohort empty for its
hash and falls through to the SessionFallback gate the fix
unblocks.

The faster cadence is fine for the gunicorn worker pool —
ContentType.objects.get() is a single-row read on a four-row
table, sub-millisecond. Recording side: thread captures still
land between HTTP test windows during the exerciser's 6s pause,
so the lax-promotion shape that got captures into the session
pool is unaffected.

Refs:
  - keploy/integrations#167
  - keploy/integrations pipeline #863 (build-build cell green on
    tests but check_lane_exercises_session_fallback fires)
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…tainer

Move `python manage.py migrate` out of the api container's entrypoint
into a one-shot `migrator` service. The api service now starts with
the schema and ContentType rows already present, so the only DB
traffic captured by keploy is the runtime /lookup/ query path.

Why: Django's `post_migrate` signal fires `create_contenttypes`,
which bulk-inserts ContentType rows in a single simple-query INSERT
whose row-tuple ordering depends on model-registration timing during
app boot. When the keploy proxy captures that INSERT, the recording
is fragile: small boot-time shifts between record and replay (the
BACKGROUND_LOOKUPS thread firing during migrate, GIL scheduling
differences, etc) produce a hash-equal but value-divergent INSERT at
replay time, and the matcher correctly rejects it. The keploy/
integrations build-build cell was halting at boot with this exact
shape (`bind values diverged from every recorded invocation`,
candidates=2, sessionFallback probed-empty).

Migrations only need to populate static schema; running them in a
sibling container that bypasses the keploy proxy (keploy intercepts
api's traffic only) keeps the boot path out of mocks.yaml entirely.
The `service_completed_successfully` dependency guarantees api
doesn't start until migrate is done.

Signed-off-by: Akash Kumar <officialasishkumar@gmail.com>
…inistic migration order

Django's `post_migrate` signal fires `create_contenttypes`, which
bulk-inserts ContentType rows from a model collection iterated in
hash-seed-dependent order on some Django versions. Python's per-
process random hash seed (default behavior since 3.3) means the
INSERT row tuples can land in different orders across two runs of
the same code, even on identical machines and identical container
images.

Under keploy record/replay this surfaces as: the recorder captures
one row order; the replay-time live driver fires a different row
order; SQL skeleton hashes match (same column count, same value
arity) but the inline value tuples diverge, and the matcher
correctly rejects the candidates as `bind values diverged from
every recorded invocation`. The keploy/integrations build-build
cell halted at boot with that exact shape.

PYTHONHASHSEED=0 fixes the seed across runs so iteration order is
deterministic. Affects only this sample app — the lookup endpoints
themselves are already deterministic.

Signed-off-by: Akash Kumar <officialasishkumar@gmail.com>
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…pe-lookup thread

The thread was added to scatter `SELECT django_content_type` captures
across HTTP test windows so the keploy/integrations regression-
coverage lane would hit the postgres-v3 session-fallback path. In
practice the path's per-test cohort vs session-pool coverage was
timing-dependent: a thread-fired capture for, say, (auth, user) may
or may not land in the session pool on any given run. When the lane
*did* hit session-fallback for a test whose required bind wasn't in
the pool, pickSessionFallback's FIFO fallback served a wrong-bind
candidate and the test failed on body diff — a separate matcher bug
unrelated to the lifetime-gate fix this sample is paired with.

The deterministic falsifying e2e for the lifetime-gate bug is the
doccano lane in keploy/enterprise; this sample's job is straight
Django+Postgres regression coverage. Drop the thread to keep the
recordings deterministic. apps.py already gates the spawn on
BACKGROUND_LOOKUPS=1 so removing the env var is a clean disable;
the code stays in case the thread is ever needed again.

Signed-off-by: Akash Kumar <officialasishkumar@gmail.com>
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
…falsifier via post-response side query

Adds a daemon thread to the /lookup/ view that fires `SELECT
django_content_type WHERE app_label='auth' AND model='user'` 100ms
after the response is sent. This deliberately creates a record-vs-
replay timing asymmetry: at record (exerciser pacing ~1s) the call
lands between HTTP test windows and is captured into the session
pool with LifetimePerTest (lax-promoted); at replay (keploy `test`
compresses pacing to tens of ms) the same 100ms-delayed call lands
INSIDE a later test's window where the perTest cohort holds a
different bind, so the dispatcher routes PerTest -> SessionFallback.

This is the path the lifetime-gate fix in keploy/integrations#167
unblocks. With the fix in the matcher, the gate accepts the lax-
promoted candidate and the response is served. Without it, the gate
rejects every PerTest-tagged candidate and the matcher logs
`transactional: no invocation matched`.

Used by the keploy/integrations django-permission-cohort-postgres
lane to falsify the bug end-to-end on the cross-version build-latest
cell — without this mechanism the lane is regression coverage but
not falsifying.

Signed-off-by: Akash Kumar <officialasishkumar@gmail.com>
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant