🚀 The feature, motivation and pitch
Status
Draft.
Summary
Add EAGLE-3 speculative decoding to the ExecuTorch LLM runtime as a new token generation strategy under TextLLMRunner. Do not duplicate tokenization, prompt prefill, callbacks, stats, or runner lifecycle code.
P0 targets:
- text-only decoder models
- Gemma4 31B support
- batch size 1
- chain speculation with fixed
num_speculative_tokens
- greedy verification and non-greedy target-compatible rejection sampling
- one target
.pte and one EAGLE-3 draft .pte
- CUDA and MLX backend support
- validated target KV strategy so rejected tokens cannot affect future generation
- a draft-method interface that can later support MTP
P1 starts with batch size greater than 1. Tree speculation, continuous batching, MTP implementation, and backend-specific optimizations are later extensions.
Motivation
Current ExecuTorch LLM generation performs one target decode per emitted token in the common path. EAGLE-3 can reduce target invocations by using a smaller draft model to propose several future tokens, then verifying them with the target model in one window.
The design should fit the existing ExecuTorch runner stack and keep speculation as a generation policy, not a second LLM runner.
Prior Art
vLLM is the closest P0 reference: simple EAGLE-3 chain speculation, separate draft model, and target auxiliary hidden states from selected layers.
SGLang is the reference for full top-k tree speculation and advanced KV movement. That is intentionally not P0.
llama.cpp is useful architecturally: speculation is a generation-time policy with multiple proposal sources. Its EAGLE-3 path is a stub, but its MTP path shows why the runtime should separate proposal generation from target verification and KV commit.
Goals
- Reuse
TextLLMRunner and the existing public runner shape.
- Add a small generation-strategy seam below the runner.
- Support EAGLE-3 P0 without baking EAGLE-3 assumptions into the whole speculative runtime.
- Preserve exact target-model semantics for greedy and supported non-greedy sampling.
- Make target KV correctness explicit.
- Leave a clear path for MTP.
Non-Goals
- Batch size greater than 1 in P0.
- Tree speculation in P0.
- Continuous batching in P0.
- Implementing MTP in P0.
- Multimodal speculative decoding.
- Model families beyond Gemma4 31B on day one.
- Backends other than CUDA and MLX on day one.
Key Design Choices
1. Do not duplicate the LLM runner
Keep TextLLMRunner responsible for:
- tokenizer ownership
- prompt prefill
- callbacks
- stats
- stop handling
- metadata
- lifecycle
Refactor token generation behind a small TokenGenerator interface. The current TextTokenGenerator remains the default implementation. Speculative decoding adds SpeculativeTokenGenerator.
2. Use a method-neutral draft interface
The speculative loop should not know whether proposals come from EAGLE-3, MTP, or another draft method.
Introduce a SpeculativeDraftRunner abstraction with responsibilities:
- load draft resources
- propose up to
num_speculative_tokens
- return draft probabilities needed for sampling correction
- accept or roll back draft state
- declare required target hidden states
P0 implements Eagle3DraftRunner. A future MtpDraftRunner should plug into the same interface without changing TextLLMRunner, target verification, sampling correction, KV commit, or stats.
3. Keep target hidden-state export general
EAGLE-3 needs selected target layer hidden states, commonly:
(2, n_layers // 2, n_layers - 3)
MTP typically needs the final pre-norm hidden state. The target export should therefore expose a general hidden-state descriptor rather than EAGLE-3-only output names.
Hidden-state selection is fixed at export time. Runtime config should not choose arbitrary layer IDs unless the exported target contains them. The target .pte should carry metadata describing available hidden-state outputs, and the draft runner should validate that metadata at load time.
P0 target export should use multiple methods in the same target .pte:
target_decode(token, position) -> last_position_logits, requested_hidden_states
target_verify(tokens[K], positions[K]) -> logits[K], requested_hidden_states[K]
target_prefill(tokens[N], positions[N]) -> last_position_logits, requested_hidden_states_for_last_token
target_verify needs logits and hidden states for all K positions. After accepting tokens and selecting the fallback or bonus token, the runtime should select the corresponding hidden state from the verify output for the next draft step, avoiding an extra target decode. target_decode is only needed when speculation is disabled or when speculative generation intentionally falls back to target-only decoding. target_prefill bootstraps the first speculative draft from the final prompt token.
KV handling is not part of the conceptual return signature. In position-overwrite mode, KV writes happen in-place during forward. In explicit scratch/commit mode, SpecKVCacheManager owns scratch buffers and commit side effects.
target_verify should use generate_full_logits or an equivalent full-logits output. target_decode and target_prefill should stay last-logit oriented for efficiency. Always using full logits for all methods is simpler but should not be the P0 default because it wastes decode work; a third .pte for verify is unnecessary unless multimethod export proves insufficient.
4. Use separate target and draft .pte files in P0
P0 loads one Gemma4 31B target .pte and one matching EAGLE-3 draft .pte. The draft .pte should be produced by a dedicated EAGLE-3 draft export path or config, because its input signature differs from a standard decoder.
The EAGLE-3 draft model consumes:
- last token
- position
- target auxiliary hidden state
- draft KV state
It returns draft logits and updated draft state. For non-greedy P0, the runtime must have enough draft probability information to perform rejection sampling.
Speculative decoding is opt-in because the draft model and draft KV cache increase memory use. P0 validation should report target memory, draft model memory, draft KV memory, and peak memory on CUDA and MLX.
For planning, assume the EAGLE-3 draft head is much smaller than the 31B target but still material: roughly O(100M-300M) parameters until the Gemma4 31B artifact is measured. P0 should record the actual exported draft parameter count and memory footprint; deployments should expect draft weights plus draft KV to be the main incremental cost.
5. Define the target KV strategy
Target verification must not let rejected tokens affect future generation.
P0 can use either path, depending on the exported cache style:
- Position-overwrite for position-indexed implicit KV: verify writes positions
[pos, pos + K), then roll the logical position forward only by the accepted count. Stale entries after the accepted prefix are harmless if masks only attend to positions before the current logical position, and later verification overwrites them before attention can read them.
- Explicit scratch/commit: verify into scratch KV/update buffers, then commit only accepted positions.
The runtime invariant is:
committed target KV is valid through the last emitted token
rejected KV entries are either unreachable by mask or never committed
P0 must pick and validate the strategy per target export/backend combination. If a cache implementation cannot guarantee this invariant, speculative mode must be rejected for that artifact.
For fused cache-update attention paths such as sdpa_with_kv_cache, position-overwrite is valid only if the forward writes new K/V before attention reads the cache. A non-fused path that reads cache before updating it can leak stale rejected entries and must use explicit scratch/commit or be rejected.
SpecKVCacheManager should own this policy decision and the logical position bookkeeping. In position-overwrite mode, it tracks committed position and validates the mask/update ordering invariant. In explicit mode, it owns scratch buffers and commits accepted updates.
6. Support greedy and non-greedy semantics in P0
Greedy acceptance:
accept draft_token[i] iff draft_token[i] == argmax(target_logits[i])
Non-greedy acceptance should use standard speculative decoding correction:
accept x with probability min(1, p_target(x) / p_draft(x))
on rejection, sample from normalize(max(p_target - p_draft, 0))
on full acceptance, sample a bonus token from the final target distribution when available
Only support logit processors that can be replayed consistently across speculative positions. Unsupported processors should fail clearly or intentionally fall back.
7. Target CUDA and MLX first
CUDA and MLX are P0 backends. The runtime interfaces should stay backend-neutral, but validation and performance work should prove the path on both backends before calling P0 complete.
8. Target Gemma4 31B first
Gemma4 31B is the P0 model target. The implementation should avoid Gemma-specific runtime assumptions where practical, but P0 validation and performance work should focus on this model.
Runtime Flow
For each generation step:
- Prefill returns the hidden state for the final prompt token to bootstrap the first draft.
- Draft proposes
K tokens from the last emitted token and required target hidden state.
- Target verifies the
K proposed tokens and returns logits for all K positions.
- Runtime applies greedy or non-greedy acceptance.
- Accepted tokens are emitted.
- A fallback or bonus token is emitted when needed.
- Target KV advances only to the emitted-token position.
- Draft state accepts or rolls back to match the committed sequence.
- Required target hidden state for the next draft step is selected from the verify output.
Configuration Sketch
struct SpeculativeConfig {
bool enabled = false;
SpeculativeMethod method = SpeculativeMethod::Eagle3;
std::string draft_model_path;
std::vector<std::string> draft_data_files;
int32_t num_speculative_tokens = 3;
SamplingMode sampling_mode = SamplingMode::TargetCompatible;
};
Target hidden-state IDs should come from exported model metadata, not from runtime-only configuration.
Validation
Correctness:
- Greedy speculative output exactly matches greedy target-only output.
- Reject, accept, and mixed accept/reject cases do not corrupt target KV.
- Position-overwrite KV mode proves stale rejected entries are unreachable, or the target uses explicit scratch/commit.
- Gemma4 31B export on CUDA and MLX verifies cache update happens before attention reads when using position-overwrite.
- EOS and
max_new_tokens are handled correctly across multi-token emission.
- Seeded non-greedy tests validate target-compatible sampling correction.
- Unsupported logit processors fail clearly or fall back intentionally.
Runtime:
- Draft load failure reports a clear error.
- Missing target hidden-state methods reject speculative config.
- Hidden-state metadata mismatch between target and draft rejects speculative config.
- Target verify method returns full logits and requested hidden states for all verification positions.
- Target decode and prefill methods avoid full-logits output unless needed.
- Prefill returns the hidden state needed to bootstrap the first draft.
- Gemma4 31B target and draft artifacts load and run on both P0 backends.
- Batch size greater than 1 rejects or falls back in P0.
- CUDA and MLX both pass the P0 correctness suite.
Performance:
- Report draft calls, target verify calls, target decode calls, accepted tokens, rejected tokens, and acceptance rate.
- Report target memory, draft parameter count, draft model memory, draft KV memory, and peak memory.
- Compare tokens/sec against target-only generation on CUDA and MLX.
Rollout
- Add
TokenGenerator; keep existing TextTokenGenerator behavior unchanged.
- Decide and validate the target KV strategy for Gemma4 31B on CUDA and MLX.
- Add target hidden-state export, prefill hidden-state bootstrap, and full-logits verify method.
- Add EAGLE-3 draft export.
- Add
SpeculativeDraftRunner, Eagle3DraftRunner, Eagle3Verifier, SpecKVCacheManager, and SpeculativeTokenGenerator.
- Add greedy verification.
- Add non-greedy rejection sampling.
- Validate memory and performance for Gemma4 31B on CUDA.
- Validate memory and performance for Gemma4 31B on MLX.
- Prove the draft interface can support future MTP.
- Add batch size greater than 1 in P1.
Open Questions
- Should Gemma4 31B P0 use position-overwrite implicit KV, explicit scratch/commit KV, or support both?
- What hidden-state descriptor should be standardized for EAGLE-3 and MTP?
- What probability data should draft runners return: full logits, post-processor logits, or sampled-token probabilities?
- Should unsupported configs fall back or return
InvalidArgument by default?
- Where should speculative stats live?
P0 Acceptance Criteria
TextLLMRunner can be configured with an EAGLE-3 draft .pte.
- Gemma4 31B target and matching EAGLE-3 draft run on CUDA and MLX.
- Existing non-speculative generation is behaviorally unchanged.
- Greedy speculative output matches target-only greedy output.
- Non-greedy speculative sampling uses target-compatible rejection sampling.
- Rejected speculative tokens do not corrupt target KV.
- Target prefill bootstraps the first draft with the required hidden state.
- Target verify returns full logits and requested hidden states for all verified positions.
- Memory overhead is reported for the target, draft model, draft KV, and peak runtime.
- CUDA and MLX pass correctness and basic performance validation.
- The draft-runner interface can accommodate MTP without changing runner or verifier contracts.
- Acceptance stats are visible.
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
🚀 The feature, motivation and pitch
Status
Draft.
Summary
Add EAGLE-3 speculative decoding to the ExecuTorch LLM runtime as a new token generation strategy under
TextLLMRunner. Do not duplicate tokenization, prompt prefill, callbacks, stats, or runner lifecycle code.P0 targets:
num_speculative_tokens.pteand one EAGLE-3 draft.pteP1 starts with batch size greater than 1. Tree speculation, continuous batching, MTP implementation, and backend-specific optimizations are later extensions.
Motivation
Current ExecuTorch LLM generation performs one target decode per emitted token in the common path. EAGLE-3 can reduce target invocations by using a smaller draft model to propose several future tokens, then verifying them with the target model in one window.
The design should fit the existing ExecuTorch runner stack and keep speculation as a generation policy, not a second LLM runner.
Prior Art
vLLM is the closest P0 reference: simple EAGLE-3 chain speculation, separate draft model, and target auxiliary hidden states from selected layers.
SGLang is the reference for full top-k tree speculation and advanced KV movement. That is intentionally not P0.
llama.cpp is useful architecturally: speculation is a generation-time policy with multiple proposal sources. Its EAGLE-3 path is a stub, but its MTP path shows why the runtime should separate proposal generation from target verification and KV commit.
Goals
TextLLMRunnerand the existing public runner shape.Non-Goals
Key Design Choices
1. Do not duplicate the LLM runner
Keep
TextLLMRunnerresponsible for:Refactor token generation behind a small
TokenGeneratorinterface. The currentTextTokenGeneratorremains the default implementation. Speculative decoding addsSpeculativeTokenGenerator.2. Use a method-neutral draft interface
The speculative loop should not know whether proposals come from EAGLE-3, MTP, or another draft method.
Introduce a
SpeculativeDraftRunnerabstraction with responsibilities:num_speculative_tokensP0 implements
Eagle3DraftRunner. A futureMtpDraftRunnershould plug into the same interface without changingTextLLMRunner, target verification, sampling correction, KV commit, or stats.3. Keep target hidden-state export general
EAGLE-3 needs selected target layer hidden states, commonly:
MTP typically needs the final pre-norm hidden state. The target export should therefore expose a general hidden-state descriptor rather than EAGLE-3-only output names.
Hidden-state selection is fixed at export time. Runtime config should not choose arbitrary layer IDs unless the exported target contains them. The target
.pteshould carry metadata describing available hidden-state outputs, and the draft runner should validate that metadata at load time.P0 target export should use multiple methods in the same target
.pte:target_verifyneeds logits and hidden states for allKpositions. After accepting tokens and selecting the fallback or bonus token, the runtime should select the corresponding hidden state from the verify output for the next draft step, avoiding an extra target decode.target_decodeis only needed when speculation is disabled or when speculative generation intentionally falls back to target-only decoding.target_prefillbootstraps the first speculative draft from the final prompt token.KV handling is not part of the conceptual return signature. In position-overwrite mode, KV writes happen in-place during forward. In explicit scratch/commit mode,
SpecKVCacheManagerowns scratch buffers and commit side effects.target_verifyshould usegenerate_full_logitsor an equivalent full-logits output.target_decodeandtarget_prefillshould stay last-logit oriented for efficiency. Always using full logits for all methods is simpler but should not be the P0 default because it wastes decode work; a third.ptefor verify is unnecessary unless multimethod export proves insufficient.4. Use separate target and draft
.ptefiles in P0P0 loads one Gemma4 31B target
.pteand one matching EAGLE-3 draft.pte. The draft.pteshould be produced by a dedicated EAGLE-3 draft export path or config, because its input signature differs from a standard decoder.The EAGLE-3 draft model consumes:
It returns draft logits and updated draft state. For non-greedy P0, the runtime must have enough draft probability information to perform rejection sampling.
Speculative decoding is opt-in because the draft model and draft KV cache increase memory use. P0 validation should report target memory, draft model memory, draft KV memory, and peak memory on CUDA and MLX.
For planning, assume the EAGLE-3 draft head is much smaller than the 31B target but still material: roughly O(100M-300M) parameters until the Gemma4 31B artifact is measured. P0 should record the actual exported draft parameter count and memory footprint; deployments should expect draft weights plus draft KV to be the main incremental cost.
5. Define the target KV strategy
Target verification must not let rejected tokens affect future generation.
P0 can use either path, depending on the exported cache style:
[pos, pos + K), then roll the logical position forward only by the accepted count. Stale entries after the accepted prefix are harmless if masks only attend to positions before the current logical position, and later verification overwrites them before attention can read them.The runtime invariant is:
P0 must pick and validate the strategy per target export/backend combination. If a cache implementation cannot guarantee this invariant, speculative mode must be rejected for that artifact.
For fused cache-update attention paths such as
sdpa_with_kv_cache, position-overwrite is valid only if the forward writes new K/V before attention reads the cache. A non-fused path that reads cache before updating it can leak stale rejected entries and must use explicit scratch/commit or be rejected.SpecKVCacheManagershould own this policy decision and the logical position bookkeeping. In position-overwrite mode, it tracks committed position and validates the mask/update ordering invariant. In explicit mode, it owns scratch buffers and commits accepted updates.6. Support greedy and non-greedy semantics in P0
Greedy acceptance:
Non-greedy acceptance should use standard speculative decoding correction:
Only support logit processors that can be replayed consistently across speculative positions. Unsupported processors should fail clearly or intentionally fall back.
7. Target CUDA and MLX first
CUDA and MLX are P0 backends. The runtime interfaces should stay backend-neutral, but validation and performance work should prove the path on both backends before calling P0 complete.
8. Target Gemma4 31B first
Gemma4 31B is the P0 model target. The implementation should avoid Gemma-specific runtime assumptions where practical, but P0 validation and performance work should focus on this model.
Runtime Flow
For each generation step:
Ktokens from the last emitted token and required target hidden state.Kproposed tokens and returns logits for allKpositions.Configuration Sketch
Target hidden-state IDs should come from exported model metadata, not from runtime-only configuration.
Validation
Correctness:
max_new_tokensare handled correctly across multi-token emission.Runtime:
Performance:
Rollout
TokenGenerator; keep existingTextTokenGeneratorbehavior unchanged.SpeculativeDraftRunner,Eagle3DraftRunner,Eagle3Verifier,SpecKVCacheManager, andSpeculativeTokenGenerator.Open Questions
InvalidArgumentby default?P0 Acceptance Criteria
TextLLMRunnercan be configured with an EAGLE-3 draft.pte.Alternatives
No response
Additional context
No response
RFC (Optional)
No response