Refactor/tornadovm planning#117
Open
orionpapadakis wants to merge 6 commits into
Open
Conversation
Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.
…TornadoVM components
…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.
…ing with updated naming conventions.
…ill-decode and CUDA-graph variants
mikepapadim
reviewed
May 30, 2026
Comment on lines
+157
to
+162
| MemorySegment tokenEmbeddings = weights.getTokenEmbeddingTable().asByteArray().getSegment(); | ||
| int blocksPerToken = (configuration.dim() + 31) / 32; | ||
| long bytesPerToken = (long) blocksPerToken * 34; | ||
| MemorySegment.copy(tokenEmbeddings, (long) token * bytesPerToken, | ||
| state.embeddingX.getSegment(), 0, bytesPerToken); | ||
| } |
Member
There was a problem hiding this comment.
maybe this should be a method on each own. Same for the above
mikepapadim
reviewed
May 30, 2026
| } | ||
|
|
||
| // ── Q8_0 Batch Kernels ─────────────────────────────────────────────────── | ||
|
|
Member
There was a problem hiding this comment.
format is odd. use @Formatter: on / off of the block and pass the autoformatter
mikepapadim
reviewed
May 30, 2026
| } | ||
|
|
||
| @Override | ||
| protected String predecessorGraphName(int layerIndex) { |
Member
There was a problem hiding this comment.
again formatter - use annotations eitherwise in the first autoformatitng pass it will be got flat.
mikepapadim
reviewed
May 30, 2026
|
|
||
| // ── Embedding preparation ───────────────────────────────────────────────── | ||
|
|
||
| @Override public EmbeddingPreparer embeddingPreparer() { |
Member
There was a problem hiding this comment.
add javadoc as this a new functionality no one else knows what it does.
mikepapadim
reviewed
May 30, 2026
| } | ||
|
|
||
| @Override public ActivationTaskGraph standardActivation() { | ||
| return new Activation("activationUpdate", state, weights, config); |
Member
There was a problem hiding this comment.
maybe 'actiovationUpdate' and 'logits' strings should be in an enum or record that reuse that instead of have these Strings all over the place.
mikepapadim
reviewed
May 30, 2026
Member
mikepapadim
left a comment
There was a problem hiding this comment.
LGTM, some minor changes needed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reorganizes TornadoVM execution planning around three variant axes:
The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.
This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.
Notes
Verification
use java 21 or 25
setup tornadovm
mvn clean installllama fp16 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048llama fp16 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decodellama fp16 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphsllama q8_0 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048llama q8_0 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decodellama q8_0 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphsany other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message: