RFC: EAGLE-3 speculative decoding for ExecuTorch LLM runner

### 🚀 The feature, motivation and pitch

## Status

Draft.

## Summary

Add EAGLE-3 speculative decoding to the ExecuTorch LLM runtime as a new token generation strategy under `TextLLMRunner`. Do not duplicate tokenization, prompt prefill, callbacks, stats, or runner lifecycle code.

P0 targets:

- text-only decoder models
- Gemma4 31B support
- batch size 1
- chain speculation with fixed `num_speculative_tokens`
- greedy verification and non-greedy target-compatible rejection sampling
- one target `.pte` and one EAGLE-3 draft `.pte`
- CUDA and MLX backend support
- validated target KV strategy so rejected tokens cannot affect future generation
- a draft-method interface that can later support MTP

P1 starts with batch size greater than 1. Tree speculation, continuous batching, MTP implementation, and backend-specific optimizations are later extensions.

## Motivation

Current ExecuTorch LLM generation performs one target decode per emitted token in the common path. EAGLE-3 can reduce target invocations by using a smaller draft model to propose several future tokens, then verifying them with the target model in one window.

The design should fit the existing ExecuTorch runner stack and keep speculation as a generation policy, not a second LLM runner.

## Prior Art

vLLM is the closest P0 reference: simple EAGLE-3 chain speculation, separate draft model, and target auxiliary hidden states from selected layers.

SGLang is the reference for full top-k tree speculation and advanced KV movement. That is intentionally not P0.

llama.cpp is useful architecturally: speculation is a generation-time policy with multiple proposal sources. Its EAGLE-3 path is a stub, but its MTP path shows why the runtime should separate proposal generation from target verification and KV commit.

## Goals

- Reuse `TextLLMRunner` and the existing public runner shape.
- Add a small generation-strategy seam below the runner.
- Support EAGLE-3 P0 without baking EAGLE-3 assumptions into the whole speculative runtime.
- Preserve exact target-model semantics for greedy and supported non-greedy sampling.
- Make target KV correctness explicit.
- Leave a clear path for MTP.

## Non-Goals

- Batch size greater than 1 in P0.
- Tree speculation in P0.
- Continuous batching in P0.
- Implementing MTP in P0.
- Multimodal speculative decoding.
- Model families beyond Gemma4 31B on day one.
- Backends other than CUDA and MLX on day one.

## Key Design Choices

### 1. Do not duplicate the LLM runner

Keep `TextLLMRunner` responsible for:

- tokenizer ownership
- prompt prefill
- callbacks
- stats
- stop handling
- metadata
- lifecycle

Refactor token generation behind a small `TokenGenerator` interface. The current `TextTokenGenerator` remains the default implementation. Speculative decoding adds `SpeculativeTokenGenerator`.

### 2. Use a method-neutral draft interface

The speculative loop should not know whether proposals come from EAGLE-3, MTP, or another draft method.

Introduce a `SpeculativeDraftRunner` abstraction with responsibilities:

- load draft resources
- propose up to `num_speculative_tokens`
- return draft probabilities needed for sampling correction
- accept or roll back draft state
- declare required target hidden states

P0 implements `Eagle3DraftRunner`. A future `MtpDraftRunner` should plug into the same interface without changing `TextLLMRunner`, target verification, sampling correction, KV commit, or stats.

### 3. Keep target hidden-state export general

EAGLE-3 needs selected target layer hidden states, commonly:

```text
(2, n_layers // 2, n_layers - 3)
```

MTP typically needs the final pre-norm hidden state. The target export should therefore expose a general hidden-state descriptor rather than EAGLE-3-only output names.

Hidden-state selection is fixed at export time. Runtime config should not choose arbitrary layer IDs unless the exported target contains them. The target `.pte` should carry metadata describing available hidden-state outputs, and the draft runner should validate that metadata at load time.

P0 target export should use multiple methods in the same target `.pte`:

```text
target_decode(token, position) -> last_position_logits, requested_hidden_states
target_verify(tokens[K], positions[K]) -> logits[K], requested_hidden_states[K]
target_prefill(tokens[N], positions[N]) -> last_position_logits, requested_hidden_states_for_last_token
```

`target_verify` needs logits and hidden states for all `K` positions. After accepting tokens and selecting the fallback or bonus token, the runtime should select the corresponding hidden state from the verify output for the next draft step, avoiding an extra target decode. `target_decode` is only needed when speculation is disabled or when speculative generation intentionally falls back to target-only decoding. `target_prefill` bootstraps the first speculative draft from the final prompt token.

KV handling is not part of the conceptual return signature. In position-overwrite mode, KV writes happen in-place during forward. In explicit scratch/commit mode, `SpecKVCacheManager` owns scratch buffers and commit side effects.

`target_verify` should use `generate_full_logits` or an equivalent full-logits output. `target_decode` and `target_prefill` should stay last-logit oriented for efficiency. Always using full logits for all methods is simpler but should not be the P0 default because it wastes decode work; a third `.pte` for verify is unnecessary unless multimethod export proves insufficient.

### 4. Use separate target and draft `.pte` files in P0

P0 loads one Gemma4 31B target `.pte` and one matching EAGLE-3 draft `.pte`. The draft `.pte` should be produced by a dedicated EAGLE-3 draft export path or config, because its input signature differs from a standard decoder.

The EAGLE-3 draft model consumes:

- last token
- position
- target auxiliary hidden state
- draft KV state

It returns draft logits and updated draft state. For non-greedy P0, the runtime must have enough draft probability information to perform rejection sampling.

Speculative decoding is opt-in because the draft model and draft KV cache increase memory use. P0 validation should report target memory, draft model memory, draft KV memory, and peak memory on CUDA and MLX.

For planning, assume the EAGLE-3 draft head is much smaller than the 31B target but still material: roughly O(100M-300M) parameters until the Gemma4 31B artifact is measured. P0 should record the actual exported draft parameter count and memory footprint; deployments should expect draft weights plus draft KV to be the main incremental cost.

### 5. Define the target KV strategy

Target verification must not let rejected tokens affect future generation.

P0 can use either path, depending on the exported cache style:

1. Position-overwrite for position-indexed implicit KV: verify writes positions `[pos, pos + K)`, then roll the logical position forward only by the accepted count. Stale entries after the accepted prefix are harmless if masks only attend to positions before the current logical position, and later verification overwrites them before attention can read them.
2. Explicit scratch/commit: verify into scratch KV/update buffers, then commit only accepted positions.

The runtime invariant is:

```text
committed target KV is valid through the last emitted token
rejected KV entries are either unreachable by mask or never committed
```

P0 must pick and validate the strategy per target export/backend combination. If a cache implementation cannot guarantee this invariant, speculative mode must be rejected for that artifact.

For fused cache-update attention paths such as `sdpa_with_kv_cache`, position-overwrite is valid only if the forward writes new K/V before attention reads the cache. A non-fused path that reads cache before updating it can leak stale rejected entries and must use explicit scratch/commit or be rejected.

`SpecKVCacheManager` should own this policy decision and the logical position bookkeeping. In position-overwrite mode, it tracks committed position and validates the mask/update ordering invariant. In explicit mode, it owns scratch buffers and commits accepted updates.

### 6. Support greedy and non-greedy semantics in P0

Greedy acceptance:

```text
accept draft_token[i] iff draft_token[i] == argmax(target_logits[i])
```

Non-greedy acceptance should use standard speculative decoding correction:

```text
accept x with probability min(1, p_target(x) / p_draft(x))
on rejection, sample from normalize(max(p_target - p_draft, 0))
on full acceptance, sample a bonus token from the final target distribution when available
```

Only support logit processors that can be replayed consistently across speculative positions. Unsupported processors should fail clearly or intentionally fall back.

### 7. Target CUDA and MLX first

CUDA and MLX are P0 backends. The runtime interfaces should stay backend-neutral, but validation and performance work should prove the path on both backends before calling P0 complete.

### 8. Target Gemma4 31B first

Gemma4 31B is the P0 model target. The implementation should avoid Gemma-specific runtime assumptions where practical, but P0 validation and performance work should focus on this model.

## Runtime Flow

For each generation step:

1. Prefill returns the hidden state for the final prompt token to bootstrap the first draft.
2. Draft proposes `K` tokens from the last emitted token and required target hidden state.
3. Target verifies the `K` proposed tokens and returns logits for all `K` positions.
4. Runtime applies greedy or non-greedy acceptance.
5. Accepted tokens are emitted.
6. A fallback or bonus token is emitted when needed.
7. Target KV advances only to the emitted-token position.
8. Draft state accepts or rolls back to match the committed sequence.
9. Required target hidden state for the next draft step is selected from the verify output.

## Configuration Sketch

```cpp
struct SpeculativeConfig {
  bool enabled = false;
  SpeculativeMethod method = SpeculativeMethod::Eagle3;
  std::string draft_model_path;
  std::vector<std::string> draft_data_files;
  int32_t num_speculative_tokens = 3;
  SamplingMode sampling_mode = SamplingMode::TargetCompatible;
};
```

Target hidden-state IDs should come from exported model metadata, not from runtime-only configuration.

## Validation

Correctness:

- Greedy speculative output exactly matches greedy target-only output.
- Reject, accept, and mixed accept/reject cases do not corrupt target KV.
- Position-overwrite KV mode proves stale rejected entries are unreachable, or the target uses explicit scratch/commit.
- Gemma4 31B export on CUDA and MLX verifies cache update happens before attention reads when using position-overwrite.
- EOS and `max_new_tokens` are handled correctly across multi-token emission.
- Seeded non-greedy tests validate target-compatible sampling correction.
- Unsupported logit processors fail clearly or fall back intentionally.

Runtime:

- Draft load failure reports a clear error.
- Missing target hidden-state methods reject speculative config.
- Hidden-state metadata mismatch between target and draft rejects speculative config.
- Target verify method returns full logits and requested hidden states for all verification positions.
- Target decode and prefill methods avoid full-logits output unless needed.
- Prefill returns the hidden state needed to bootstrap the first draft.
- Gemma4 31B target and draft artifacts load and run on both P0 backends.
- Batch size greater than 1 rejects or falls back in P0.
- CUDA and MLX both pass the P0 correctness suite.

Performance:

- Report draft calls, target verify calls, target decode calls, accepted tokens, rejected tokens, and acceptance rate.
- Report target memory, draft parameter count, draft model memory, draft KV memory, and peak memory.
- Compare tokens/sec against target-only generation on CUDA and MLX.

## Rollout

1. Add `TokenGenerator`; keep existing `TextTokenGenerator` behavior unchanged.
2. Decide and validate the target KV strategy for Gemma4 31B on CUDA and MLX.
3. Add target hidden-state export, prefill hidden-state bootstrap, and full-logits verify method.
4. Add EAGLE-3 draft export.
5. Add `SpeculativeDraftRunner`, `Eagle3DraftRunner`, `Eagle3Verifier`, `SpecKVCacheManager`, and `SpeculativeTokenGenerator`.
6. Add greedy verification.
7. Add non-greedy rejection sampling.
8. Validate memory and performance for Gemma4 31B on CUDA.
9. Validate memory and performance for Gemma4 31B on MLX.
10. Prove the draft interface can support future MTP.
11. Add batch size greater than 1 in P1.

## Open Questions

- Should Gemma4 31B P0 use position-overwrite implicit KV, explicit scratch/commit KV, or support both?
- What hidden-state descriptor should be standardized for EAGLE-3 and MTP?
- What probability data should draft runners return: full logits, post-processor logits, or sampled-token probabilities?
- Should unsupported configs fall back or return `InvalidArgument` by default?
- Where should speculative stats live?

## P0 Acceptance Criteria

- `TextLLMRunner` can be configured with an EAGLE-3 draft `.pte`.
- Gemma4 31B target and matching EAGLE-3 draft run on CUDA and MLX.
- Existing non-speculative generation is behaviorally unchanged.
- Greedy speculative output matches target-only greedy output.
- Non-greedy speculative sampling uses target-compatible rejection sampling.
- Rejected speculative tokens do not corrupt target KV.
- Target prefill bootstraps the first draft with the required hidden state.
- Target verify returns full logits and requested hidden states for all verified positions.
- Memory overhead is reported for the target, draft model, draft KV, and peak runtime.
- CUDA and MLX pass correctness and basic performance validation.
- The draft-runner interface can accommodate MTP without changing runner or verifier contracts.
- Acceptance stats are visible.


### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: EAGLE-3 speculative decoding for ExecuTorch LLM runner #19784

🚀 The feature, motivation and pitch

Status

Summary

Motivation

Prior Art

Goals

Non-Goals

Key Design Choices

1. Do not duplicate the LLM runner

2. Use a method-neutral draft interface

3. Keep target hidden-state export general

4. Use separate target and draft `.pte` files in P0

5. Define the target KV strategy

6. Support greedy and non-greedy semantics in P0

7. Target CUDA and MLX first

8. Target Gemma4 31B first

Runtime Flow

Configuration Sketch

Validation

Rollout

Open Questions

P0 Acceptance Criteria

Alternatives

Additional context

RFC (Optional)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: EAGLE-3 speculative decoding for ExecuTorch LLM runner #19784

Description

🚀 The feature, motivation and pitch

Status

Summary

Motivation

Prior Art

Goals

Non-Goals

Key Design Choices

1. Do not duplicate the LLM runner

2. Use a method-neutral draft interface

3. Keep target hidden-state export general

4. Use separate target and draft .pte files in P0

5. Define the target KV strategy

6. Support greedy and non-greedy semantics in P0

7. Target CUDA and MLX first

8. Target Gemma4 31B first

Runtime Flow

Configuration Sketch

Validation

Rollout

Open Questions

P0 Acceptance Criteria

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

4. Use separate target and draft `.pte` files in P0