[EventPipe] Add Non-Lossy EventPipe Mode for gcdump#129457
Open
mdh1418 wants to merge 3 commits into
Open
Conversation
Introduce EventPipeBufferingMode (Drop/Block) and carry it as a per-session opt-in on EventPipeSessionOptions (default Drop), plumbed through enable() and ep_session_alloc onto the buffer manager. Block is wired up here but not yet acted on; producers start parking on full buffers in a later change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In Block mode a producer that finds the buffers full parks until the reader frees capacity and then retries, instead of dropping the event. The reader signals a per-buffer-manager auto-reset event on each capacity release; a parked producer clears the WRITE_BUFFER_IN_USE bit of its session_use_in_progress (keeping the session index) so the reader can drain its full buffer while the retained index pins the buffer manager. Teardown raises an abort flag and reuses the existing in-flight-write wait to release parked producers, so no separate parked-writer count is needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
enable()/ep_enable_3 only sets the session up and collects its provider-enable callbacks onto the session; ep_start_streaming is now the single site that invokes them, for every session. This is required for a blocking GCHeapSnapshot session: its provider-enable callback triggers a GC heap walk that parks producers when the buffer fills, so it must run only after the drain thread is live to consume - otherwise the walk parks with no reader and deadlocks. ep_start_streaming waits for the drain thread to reach its preemptive drain loop before invoking the callbacks; when streaming is parked for ep_finish_init at early startup it invokes inline, which is safe because the heap walk no-ops until the GC is fully initialized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends native EventPipe to support an opt-in non-lossy (“blocking”) buffering mode intended for GC heap snapshot (gcdump) scenarios, and reworks provider-enable callback dispatch to avoid blocking producers before a drain thread exists.
Changes:
- Introduces
EventPipeBufferingMode(DROP/BLOCK) andEventPipeWriteEventResult(WRITTEN/DROPPED/BLOCKED), plumbing the buffering mode through session options to the buffer manager. - Adds a block-and-retry path in the write pipeline: writers may park on buffer exhaustion in BLOCK mode, and teardown can abort/wake parked writers.
- Defers provider enable-callback invocation to
ep_start_streaming, storing callbacks on the session until streaming begins.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/native/eventpipe/ep.h | Adds buffering_mode to EventPipeSessionOptions. |
| src/native/eventpipe/ep.c | Implements blocking retry loop, abort-on-disable, and deferred provider callback dispatch. |
| src/native/eventpipe/ep-types-forward.h | Adds new enums for buffering mode and write results. |
| src/native/eventpipe/ep-thread.h | Adds EP_SESSION_USE_WRITE_BUFFER_IN_USE bit for writer/reader coordination. |
| src/native/eventpipe/ep-thread.c | Adjusts assertions to account for the new session-use bit. |
| src/native/eventpipe/ep-session.h | Stores provider callback queue on session; changes ep_session_write_event return type. |
| src/native/eventpipe/ep-session.c | Plumbs buffering mode into buffer manager; updates inflight-wait logic; updates write path to return EventPipeWriteEventResult. |
| src/native/eventpipe/ep-buffer-manager.h | Adds BLOCK-mode fields/APIs; changes write API to return EventPipeWriteEventResult. |
| src/native/eventpipe/ep-buffer-manager.c | Implements BLOCK-mode signaling/aborting and returns BLOCKED on buffer exhaustion when appropriate. |
| src/mono/mono/eventpipe/test/ep-tests.c | Updates tests for new ep_session_write_event return type (currently has incorrect enum constant usages). |
| src/mono/mono/eventpipe/test/ep-buffer-manager-tests.c | Updates tests for new ep_buffer_manager_write_event return type (currently has incorrect enum constant usage). |
| ep_event_payload_fini (&payload); | ||
|
|
||
| ep_raise_error_if_nok (write_result == true); | ||
| ep_raise_error_if_nok (write_result == EP_WRITE_EVENT_RESULT_SUCCESS); |
| ep_event_payload_fini (&payload); | ||
|
|
||
| ep_raise_error_if_nok (write_result == true); | ||
| ep_raise_error_if_nok (write_result == EP_WRITE_EVENT_RESULT_SUCCESS); |
| ep_event_payload_fini (&payload); | ||
|
|
||
| ep_raise_error_if_nok (write_result == true); | ||
| ep_raise_error_if_nok (write_result == EP_WRITE_EVENT_RESULT_SUCCESS); |
| ep_event_payload_fini (&payload); | ||
|
|
||
| ep_raise_error_if_nok (write_result == true); | ||
| ep_raise_error_if_nok (write_result == EP_WRITE_EVENT_RESULT_SUCCESS); |
| ep_event_payload_fini (&payload); | ||
|
|
||
| ep_raise_error_if_nok (write_result == true); | ||
| ep_raise_error_if_nok (write_result == EP_WRITE_EVENT_RESULT_SUCCESS); |
| EventPipeEventPayload payload; | ||
| ep_event_payload_init (&payload, (uint8_t *)TEST_EVENT_DATA, ARRAY_SIZE (TEST_EVENT_DATA)); | ||
| result = ep_buffer_manager_write_event (buffer_manager, thread, session, ep_event, &payload, NULL, NULL, thread, NULL); | ||
| result = ep_buffer_manager_write_event (buffer_manager, thread, session, ep_event, &payload, NULL, NULL, thread, NULL) == EP_WRITE_EVENT_RESULT_SUCCESS; |
Comment on lines
+1075
to
+1079
| EventPipeProviderCallbackDataQueue *provider_callback_data_queue = ep_session_get_provider_callbacks (session); | ||
| if (provider_callback_data_queue == NULL) | ||
| return; | ||
|
|
||
| ep_session_set_provider_callbacks (session, NULL); |
This was referenced Jun 16, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Runtime side to fix dotnet/diagnostics#2404
Will add the new IPC command and corresponding changes once the IPC Protocol spec is updated in the diagnostics repo.
EventPipe's buffer manager is lossy by design: when a producer's per-thread buffer is full and the circular pool is exhausted, the event is dropped rather than blocking the writer.
This corrupts dotnet-gcdump. A heap snapshot is rebuilt from the GCBulkNode/GCBulkEdge stream emitted during a stop-the-world heap walk; on a large heap that burst overruns the buffer, edges are dropped, and the graph cannot be reconstructed, as illustrated in the diagnostics issue.
This PR introduces a Blocking/Non-Lossy mode specifically for the gcdump enabled EventPipe session, so that producers may block until buffer space is available to write events instead of the standard drop behavior.
Parking producers instead of dropping
The non-lossy guarantee comes down to one change in the write path: when a producer in Block mode has filled its buffer and the pool cannot hand it another, instead of dropping the event it parks until the reader frees capacity and then retries the same write. The reader sets a Block-only buffer_available_event every time it returns a drained buffer to the pool, so a parked producer is woken exactly when space may have appeared; it re-attempts the write and only makes forward progress once the event actually lands.
Parking has to cooperate with the existing buffer hand-off, which is where a new pin flag comes in. Each producer advertises, in its per-thread session_use_in_progress slot, the index of the session it is writing to; teardown spins on that slot to keep the session and its buffer manager alive until every in-flight writer is done. We widen that slot with a high EP_SESSION_USE_WRITE_BUFFER_IN_USE bit meaning "I currently own a write buffer," and the reader drains a producer's buffer only once that bit is clear. A parked producer therefore clears the bit but keeps the index: clearing the bit lets the reader drain the very buffer the producer just filled — the only way capacity is ever freed — while keeping the index leaves the session pinned so it cannot be torn down from under the sleeping thread. On wake the producer re-sets the bit, reloads the session pointer (disable may have cleared it), and retries.
Blocking is only ever allowed when it is actually safe to wait for a reader: the session is in Block mode, it is not tearing down, and the caller is not the rundown thread. Teardown is the first escape hatch — disable raises an aborting flag and sets the capacity event, and the parked producer, which re-checks aborting immediately before and after each wait, gives up and drops rather than waiting on a reader that is going away. Rundown is the second — rundown events are emitted from the drain side during disable, so a rundown writer that blocked would be waiting on itself; it is excluded from blocking and falls back to the normal drop path.
Expressing these three outcomes cleanly is why the write path's old success/fail boolean became an EventPipeWriteEventResult enum — WRITTEN (the event landed), DROPPED (discarded: oversized, provider disabled, suspended, or a genuine Drop-mode overflow, all already tracked by sequence numbers), and BLOCKED (Block mode, buffer full, session still live — park and retry). Only BLOCKED drives the park loop; the boolean could not carry that third state without an out-parameter.
Deferring provider-enable callbacks
Enabling a session invokes each provider's enable callback. For a gcdump (GCHeapSnapshot) session that callback synchronously forces a full, stop-the-world heap walk that emits the entire GCBulkNode/GCBulkEdge stream — which in Block mode is exactly what fills the buffer and parks the producer, and the producer can only drain if a reader thread already exists. Originally the callbacks fired inline inside ep_enable_3, before streaming starts, so the walk ran during SuspendEE with no drainer in existence: the buffer filled, the heap-walk thread parked forever, and the target deadlocked (gcdump hung right after "Requesting a .NET Heap Dump," and killing it did not recover the process).
The fix makes ep_enable_3 only set the session up and moves callback dispatch to a single site in ep_start_streaming, after the drain thread exists. When a drain thread was started we first wait for it to be running (EP_YIELD_WHILE (!ep_session_has_started)) before dispatching, so a consumer is guaranteed to be in place before the walk can fill the buffer; that drain loop runs in preemptive GC mode, so it keeps draining even while the heap walk holds the EE suspended. (Startup sessions enabled before threads can run have no drain thread and dispatch inline as before — the heap walk no-ops until the GC is fully initialized, so it cannot deadlock there.)
Testing
With no client opt-in yet, Block was forced via a temporary, uncommitted DOTNET_EventPipeBufferingMode knob (not part of this PR)
Reproducing the loss — a target allocates k = 2,000,000 objects each with r = 64 references. The dense edges inflate GCBulkEdge volume during the heap walk without inflating object count, overrunning the session buffer; the gcdump session's circular buffer is pinned to 1 MB to force that overflow on a small, fast-to-dump heap (a production heap overflows the default buffer on its own). The signal is the reconstructed BlockModeValidationNode count.
gcdump (2M nodes, 1 MB buffer)
Drop: exit -1, RootIndex not set, no dump
Block: exit 0, dump written, report reconstructs 2M nodes