Skip to content

Simplify execution model to worker + owngil modes (v3.0.0)#61

Merged
benoitc merged 17 commits intomainfrom
feature/simplify-execution-model
May 1, 2026
Merged

Simplify execution model to worker + owngil modes (v3.0.0)#61
benoitc merged 17 commits intomainfrom
feature/simplify-execution-model

Conversation

@benoitc
Copy link
Copy Markdown
Owner

@benoitc benoitc commented Apr 8, 2026

Summary

This is the preparation for the v3.0.0 release. The public surface gets smaller, while the internals are reorganized around per-context worker pthreads.

The set of public modes is now worker | owngil. Other atoms like subinterp, auto, multi_executor or free_threaded are no longer accepted. The NIF boundary validates the mode strictly and returns {error, {invalid_mode, _}} for anything else, which closes a small loophole where py_reactor_context used to silently fall back to worker.

A few APIs that did not really earn their keep have been removed: py:num_executors/0, py:async_stream/3,4, and the entire py:subinterp_* family of explicit handles. For OWN_GIL parallelism, the path is now py_context:new(#{mode => owngil}), which gives the same isolation with proper OTP supervision. The num_executors and num_async_workers config keys are gone too, since they were already no-ops, and the legacy py_async_pool gen_server with them.

On the async side, two things were genuinely broken and are now fixed. py:async_call/3,4 followed by py:async_await/1,2 used to time out silently, because the await was listening for the wrong message tag. And py:async_gather/1,2 simply returned gather_not_implemented. Both round-trip correctly now, on top of py_event_loop.

The architectural change underneath is that each context owns its dedicated pthread, with stable thread affinity. This is what we needed to make numpy, torch and tensorflow happy with their thread-local state. The multi-executor pool, the subinterpreter pool, and the various legacy scaffolds that used to live alongside are gone.

For Python 3.14+, we no longer install the deprecated global asyncio.set_event_loop_policy. The two recommended entry points, erlang.run(main) and asyncio.Runner(loop_factory=erlang.new_event_loop), do not need it and work on every supported version from 3.9 to 3.14+. The legacy erlang.install() helper now raises a RuntimeError on 3.14+ with a small message pointing at those alternatives. On 3.12-3.13 it still works, with a DeprecationWarning you can suppress per call via erlang.install(silent=True) if you knowingly stick to the legacy pattern.

Review-driven fixes

A second round of review surfaced four issues that have all been addressed:

  • A potential use-after-free on shutdown timeout: when a worker pthread was stuck in long-running Python, the 30-second join would fail, the helper returned, and the BEAM later freed the resource memory under the still-running thread. We now pin the resource via enif_keep_resource on the leak path so its memory survives the stuck thread; the callback pipes are kept open with it for the same reason. We also dropped the CTX_REQ_SHUTDOWN sentinel (which could be orphaned in the queue if the worker exited mid-loop) in favor of a direct pthread_cond_broadcast.
  • Stale {py_result, _, _} messages were piling up on the context process whenever wait_for_async_result/2 timed out (the worker can still finish later and deliver the result). A small drain now runs before the matching receive, keeping the mailbox bounded. Pinned by a new CT case that injects a fake stale result and checks the mailbox is empty after the next call.
  • ctx_request_create did not check the returns of pthread_mutex_init, pthread_cond_init and enif_alloc_env. Under resource pressure that could segfault on the first enif_make_copy of a NULL request env. The function now rolls back partial state on any init failure; all 14 call sites already test the result.
  • A small bit of dead code went out: the _ErlangChildWatcher class on the Python side, which was never instantiated, and a couple of stale "for now" comments on the SharedDict path that no longer reflected the design.

Long-running env calls

The env-bearing call, eval and exec paths used to take a synchronous dispatch with a 30-second pthread_cond_timedwait deadline. Long Python work would return {error, worker_timeout} while the worker kept going, and the eventual result would be lost. The non-env paths had already moved to async dispatch, so we just had to add the env variants on top. Three new NIFs wrap the existing async dispatch with the local env threaded through, and the Erlang side falls back to the blocking NIF only when the worker thread isn't available. The worker_timeout error is no longer reachable on the env path.

Tests and verification

The test suite now has dedicated coverage for thread affinity, the asyncio policy gate on 3.14+, the stale-result drain, and the env-async dispatch path. On Python 3.14 the final tally is 529 passed, 5 user-skipped, 0 failed, with py_async_e2e_SUITE and py_udp_e2e_SUITE both at 10/10 and no deprecation warnings during init.

The user-facing docs (README, getting-started, scalability, migration, owngil_internals, process-bound-envs, reactor, preload, testing-free-threading, asyncio) have been refreshed to remove dead snippets and stale mode names. The erlang.sleep behavior table has been rewritten to match the v3.0 worker-pthread architecture rather than the older "dirty scheduler released in all sync contexts" framing. The migration guide intentionally keeps the historical references, so users coming from v1.8 or v2.0 can still find their way.

Net delta versus v2.3.1: about 1,000 lines removed.

benoitc added 17 commits April 8, 2026 18:17
Breaking changes for v3.0.0:
- py:execution_mode/0 now returns worker | owngil only
- Removed py:num_executors/0 (each context has own worker thread)
- Removed multi_executor pool and dead dispatch code

Architecture changes:
- Per-context worker threads with stable thread affinity
- Async NIF dispatch with message passing (no dirty scheduler blocking)
- Request queue per context (replaces single-slot pattern)

Fixes numpy/torch/tensorflow thread-local state issues.
Async API:
- Reimplement py:async_gather/1 over py_event_loop (concurrent submit,
  sequential await); add async_gather/2 with explicit timeout.
- Remove py:async_stream/3,4 from public API (was always returning
  {error, stream_not_implemented}).
- Fix py:async_call/3,4 + async_await/1,2 round-trip: use
  py_event_loop:create_task and py_event_loop:await directly so the
  receive matches the {async_result, _, _} message the NIF sends.
- Delete the legacy py_async_pool gen_server (the async_call call site
  now goes through py_event_loop directly).

C-side dead code:
- Remove multi_executor_* functions, g_executors[], multi_executor_thread_main,
  multi_executor_enqueue, MAX/MIN_EXECUTORS macros and the executor_t struct.
- Remove the PY_MODE_MULTI_EXECUTOR case in executor_enqueue and the unused
  tl_current_req_* thread-locals from the legacy compatibility layer.
- Remove unused shared_dict_down (resource type was registered without
  process monitoring).
- Remove context_dispatch_call/eval/exec (declared but never called).
- Drop executor_id field from py_worker_t and py_context_t.

Internal mode enum collapse:
- py_execution_mode_t reduced to {PY_MODE_FREE_THREADED, PY_MODE_GIL}.
  Both retired values (PY_MODE_SUBINTERP, PY_MODE_MULTI_EXECUTOR) had
  identical behavior in every switch.
- Replace runtime numpy-cache check with precise compile-time gate
  (#if defined(HAVE_SUBINTERPRETERS) && !defined(HAVE_FREE_THREADED));
  preserves today's truth table on every supported build.
- nif_execution_mode now returns free_threaded | gil; update spec on
  py_nif:execution_mode/0.

Configuration cleanup:
- Remove num_executors / num_async_workers env keys (no-ops post-rework).
- Reset .app.src env to [] (the prior keys were never read).

Doc + example sync:
- README.md: drop SHARED_GIL bullet; py:execution_mode/0 returns
  worker | owngil; configuration block uses num_contexts / context_mode
  / max_concurrent.
- docs/getting-started.md, scalability.md, migration.md, owngil_internals.md,
  preload.md, process-bound-envs.md, reactor.md, testing-free-threading.md:
  drop subinterp as a context mode; correct py_event_loop:run signature.
- src/py.erl, src/py_context.erl, src/py_context_router.erl,
  src/py_context_sup.erl, src/py_reactor_context.erl: drop subinterp
  from @doc mode lists.
- c_src/py_nif.c nif_context_create header doc: same scrub.
- examples/bench_owngil.erl: rename labels (worker baseline vs OWN_GIL),
  gate on py_nif:owngil_supported (3.14+).
- examples/bench_reactor_modes.erl: convert subinterp arm to OWN_GIL
  (run_reactor_owngil_bench, Reactor/OG legend).
- examples/reactor_subinterp_example.erl: deleted (reactor_owngil_example
  covers OWN_GIL; reactor_echo covers worker).
- examples/benchmark.erl: replace py:num_executors() with
  py_context_router:num_contexts().

Tests:
- New test/py_thread_affinity_SUITE.erl asserts:
  exec/eval/call on one context share threading.get_native_id;
  N processes targeting one context converge on its worker thread;
  distinct contexts use distinct threads;
  same invariants hold under owngil mode.
  Replaces 8 untracked diagnostic escripts (deleted).
- test/py_SUITE.erl: tighten test_asyncio_call and test_asyncio_gather
  to assert real success values instead of swallowing failures.

CHANGELOG entries describe each breaking change and added behavior.

30 files changed, +244/-1199 LOC, plus the new affinity suite.
py_context:new(#{mode => owngil}) covers the same OWN_GIL parallelism
with OTP supervision and automatic cleanup, so the older handle-based
surface (subinterp_create/destroy/call/eval/exec/cast/async_call/await
+ subinterp_pool_*) is redundant. Drop the public API, the wrapper NIFs,
and the dead reactor example that called nonexistent subinterp_reactor_*
functions. Internal subinterp_thread_pool_* (used by py_event_loop_pool's
OWN_GIL backend) and the capability probe py:subinterp_supported/0 stay.

-1390/+27 LOC across 8 files; py_subinterp_SUITE deleted.
- inspiration.md: 'auto' mode -> 'worker'; py_pool:start_link -> py_context_router:start_pool
- asyncio.md: paired async_call with async_await (not await)
- py.erl: bind_context lives in py_context_router, not py
- reactor_echo example: 'auto' mode -> 'worker'
- bench_async_task header: spawn -> spawn_task
- .app.src: drop stale py_pool from registered list
py_context already validates worker | owngil via pattern matching, but
py_reactor_context calls py_nif:context_create/1 directly, which silently
mapped any non-owngil atom to worker. That hid genuine misuse — including
the auto atom that lingered in a doc snippet and one CT test.

Tighten nif_context_create to return {error, {invalid_mode, Atom}} for
anything other than worker | owngil, and update the one test that relied
on the silent-accept behavior.
- py_actor_SUITE: prefix unused Pid variable with underscore
- py_async_task_SUITE: match +0.0 instead of 0.0 (OTP 27+ no longer
  matches -0.0 to 0.0)
- py_reentrant_SUITE, py_pid_send_SUITE: replace deprecated
  code:lib_dir/2 with filename:join(code:lib_dir/1, "test")
3.14 deprecated the call; 3.16 removes it. Our run path
(erlang.run / asyncio.Runner with loop_factory=) doesn't need the
global policy — it was only convenience for bare asyncio.run()
inside py:exec. Gate both Erlang-side helpers and the Python
erlang.install() entry point on sys.version_info < (3, 14); the
latter now raises with a migration message on 3.14+.

Behavior on Python 3.9-3.13 is unchanged. Adds
test/py_asyncio_policy_SUITE.erl (7 self-skipping cases) pinning the
gate, the run-path-without-policy invariant, and a sentinel asserting
no set_event_loop_policy DeprecationWarning fires on 3.14+ init.
EDoc expects code-quote spans as `text', not `text`; the markdown-style
backticks I added in the v3.14 deprecation note crashed edoc_doclet_chunks.
The handle-level entry points (subinterp_thread_handle_create/destroy,
subinterp_thread_call/eval/exec/cast/async_call) along with the
py_subinterp_handle_t struct and PY_SUBINTERP_HANDLE_RESOURCE_TYPE
became unreachable when the public py:subinterp_* API was removed.
Only the OWN_GIL session backend (subinterp_thread_pool_*) is still
used by py_event_loop_pool, so keep that path and drop the rest.
has_result and is_error in suspended_state_t / suspended_context_state_t
were declared volatile, which doesn't guarantee atomicity or memory
ordering on weakly-ordered architectures (ARM). Upgrade to _Atomic bool
so the read in the resumer thread sees the write from the callback
thread without relying on incidental sequencing.
If a worker pthread is stuck in long-running Python, the 30s shutdown
join times out, ctx->leaked is set, and the helper returns. Today the
BEAM later runs context_destructor (which sees ctx->destroyed and
returns early) and frees the resource memory — under a thread that is
still dereferencing it. Call enif_keep_resource(ctx) on the leak path
so the refcount stays above zero forever and the memory survives the
stuck thread.

Replace the CTX_REQ_SHUTDOWN sentinel enqueue with a direct
pthread_cond_broadcast under the queue mutex. The sentinel could be
orphaned (worker mid-request returns to top of loop, sees
shutdown_requested, exits without dequeuing) and now allocations can
fail more often after ctx_request_create() honors init failures.
The broadcast wakes any parked worker so it observes the predicate
and exits cleanly.

Gate the callback_pipe close in nif_context_destroy on !ctx->leaked.
A leaked pthread inside Python that triggers erlang.call() reads from
callback_pipe[0] and writes to callback_pipe[1]; closing those fds
lets the kernel reissue the numbers to unrelated files and silently
corrupt them. The leaked pipes stay alive with the pinned resource
until VM exit.
wait_for_async_result/2 returns {error, async_timeout} after 5
minutes, but the C worker can still finish later and deliver the
result. Without a drain those messages would accumulate on the
context process's mailbox forever. Drain stale ids before the
matching receive — safe because the context process is the sole
receiver and only one wait is in flight at a time.

New test/py_context_async_drain_SUITE.erl pins the behavior by
injecting a stale {py_result, FakeRef, _} into Ctx's mailbox and
asserting it's gone after the next async-dispatched call.
pthread_mutex_init, pthread_cond_init, and enif_alloc_env can each
fail under resource pressure. Today their returns are ignored, so
mutex/cond init failures leave the request in an unusable state and
a NULL request_env causes the next enif_make_copy to segfault.

Check each return and free what was allocated on partial failure.
All 14 callers already test the result for NULL.
_ErlangChildWatcher in priv/_erlang_impl/_policy.py was never
instantiated (the actual watcher is asyncio.ThreadedChildWatcher /
SafeChildWatcher); delete the class along with its TODO.

Replace "process monitoring disabled for now to debug crash" /
"Using simple resource type without process monitoring for now" in
the SharedDict path with positive-form notes describing the actual
GC-scoped lifecycle.
dispatch_to_worker_thread_impl blocked on a 30s pthread_cond_timedwait
in the env path. ML inference and other long-running calls returned
{error, worker_timeout} while the worker kept going. Add three
async-with-env NIFs that wrap the existing dispatch_to_worker_thread_async
(which already takes a local_env), and rewire handle_*_with_suspension_and_env
plus the {exec,_,_,_,EnvRef} loop arm to async-first / sync-fallback.

The Erlang side waits in wait_for_async_result/2, which has the
stale-result drain from ef31f76, so the env path now matches the
non-env path's behavior.

Adds test/py_context_async_env_SUITE (2 cases).
The asyncio.md and migration.md tables claimed sync erlang.sleep()
always releases the dirty scheduler. In v3.0 the dirty scheduler isn't
held during sync calls anyway (async NIF dispatch returns immediately);
what blocks is the worker pthread for py:call, the Erlang process for
py:exec/py:eval, or no thread at all for awaited async sleep. Update
the table and the sleep() docstring accordingly.

erlang.install() emits a DeprecationWarning on 3.12-3.13 by design,
but users who knowingly use the legacy pattern had no clean local
opt-out. Add a keyword-only silent=False; passing silent=True
suppresses the warning without disabling DeprecationWarning globally.
The 3.14+ RuntimeError stays unconditional.
@benoitc benoitc merged commit 2075f2d into main May 1, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant