🚀 The feature, motivation and pitch
ExecuTorch's Android API already has foundational audio support (LlmModule.prefillAudio, AsrModule, C++ MultimodalRunner), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Java/Kotlin.
Current State
LlmModule has prefillAudio(float[]/byte[], ...) and prefillRawAudio(byte[], ...) -- works but limited
AsrModule is encoder-decoder specific (Whisper-style) -- not applicable to decoder-only audio LLMs
Model type constants: only MODEL_TYPE_TEXT (1) and MODEL_TYPE_MULTIMODAL (2) -- no audio-specific semantics
Proposed Improvements
- Add ByteBuffer variants for audio prefill (zero-copy parity with image API)
Image prefill has prefillImages(ByteBuffer, ...) for zero-copy JNI, but audio prefill only accepts arrays. For long audio (Voxtral handles multi-minute clips), the extra JNI copy is wasteful.
kotlin
// Missing today:
fun prefillAudio(audio: ByteBuffer, batchSize: Int, nBins: Int, nFrames: Int)
fun prefillRawAudio(audio: ByteBuffer, batchSize: Int, nChannels: Int, nSamples: Int)
2. Add WAV file path API to LlmModule
AsrModule can accept WAV paths directly (internally uses load_wav_audio_data()), but LlmModule requires the caller to manually decode audio in Java. For audio LLMs this should be as easy as:
kotlin
llmModule.prefillAudioFromFile("/path/to/audio.wav")
3. Add MODEL_TYPE_TEXT_AUDIO constant
Currently MODEL_TYPE_MULTIMODAL (aliased as MODEL_TYPE_TEXT_VISION) has vision-centric naming. Adding an explicit audio model type improves discoverability and documentation:
kotlin
const val MODEL_TYPE_TEXT_AUDIO = 3
4. Better raw audio type support
prefillRawAudio(byte[], ...) is awkward -- real PCM audio is typically short[] (16-bit) or float[] (32-bit). Add typed variants:
kotlin
fun prefillRawAudio(audio: ShortArray, batchSize: Int, nChannels: Int, nSamples: Int) // PCM-16
fun prefillRawAudio(audio: FloatArray, batchSize: Int, nChannels: Int, nSamples: Int) // float32
5. Audio-specific configuration in LlmModuleConfig
Add fields for audio preprocessing parameters:
kotlin
data class LlmModuleConfig(
// ... existing fields ...
val sampleRate: Int = 16000,
val preprocessorPath: String? = null, // optional .pte for mel spectrogram extraction
)
6. Unified multimodal generation entry point
Currently audio prefill and text generation are separate calls with no way to combine them in a single config. Consider a builder pattern:
kotlin
llmModule.generate {
audio("/path/to/audio.wav")
prompt("Transcribe the above audio:")
maxSeqLen(512)
onToken { token -> /* stream */ }
}
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
🚀 The feature, motivation and pitch
ExecuTorch's Android API already has foundational audio support (LlmModule.prefillAudio, AsrModule, C++ MultimodalRunner), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Java/Kotlin.
Current State
LlmModule has prefillAudio(float[]/byte[], ...) and prefillRawAudio(byte[], ...) -- works but limited
AsrModule is encoder-decoder specific (Whisper-style) -- not applicable to decoder-only audio LLMs
Model type constants: only MODEL_TYPE_TEXT (1) and MODEL_TYPE_MULTIMODAL (2) -- no audio-specific semantics
Proposed Improvements
Image prefill has prefillImages(ByteBuffer, ...) for zero-copy JNI, but audio prefill only accepts arrays. For long audio (Voxtral handles multi-minute clips), the extra JNI copy is wasteful.
kotlin
// Missing today:
fun prefillAudio(audio: ByteBuffer, batchSize: Int, nBins: Int, nFrames: Int)
fun prefillRawAudio(audio: ByteBuffer, batchSize: Int, nChannels: Int, nSamples: Int)
2. Add WAV file path API to LlmModule
AsrModule can accept WAV paths directly (internally uses load_wav_audio_data()), but LlmModule requires the caller to manually decode audio in Java. For audio LLMs this should be as easy as:
kotlin
llmModule.prefillAudioFromFile("/path/to/audio.wav")
3. Add MODEL_TYPE_TEXT_AUDIO constant
Currently MODEL_TYPE_MULTIMODAL (aliased as MODEL_TYPE_TEXT_VISION) has vision-centric naming. Adding an explicit audio model type improves discoverability and documentation:
kotlin
const val MODEL_TYPE_TEXT_AUDIO = 3
4. Better raw audio type support
prefillRawAudio(byte[], ...) is awkward -- real PCM audio is typically short[] (16-bit) or float[] (32-bit). Add typed variants:
kotlin
fun prefillRawAudio(audio: ShortArray, batchSize: Int, nChannels: Int, nSamples: Int) // PCM-16
fun prefillRawAudio(audio: FloatArray, batchSize: Int, nChannels: Int, nSamples: Int) // float32
5. Audio-specific configuration in LlmModuleConfig
Add fields for audio preprocessing parameters:
kotlin
data class LlmModuleConfig(
// ... existing fields ...
val sampleRate: Int = 16000,
val preprocessorPath: String? = null, // optional .pte for mel spectrogram extraction
)
6. Unified multimodal generation entry point
Currently audio prefill and text generation are separate calls with no way to combine them in a single config. Consider a builder pattern:
kotlin
llmModule.generate {
audio("/path/to/audio.wav")
prompt("Transcribe the above audio:")
maxSeqLen(512)
onToken { token -> /* stream */ }
}
Alternatives
No response
Additional context
No response
RFC (Optional)
No response