[Android] Kotlin API improvements for audio LLM models

### 🚀 The feature, motivation and pitch

ExecuTorch's Android API already has foundational audio support (LlmModule.prefillAudio, AsrModule, C++ MultimodalRunner), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Java/Kotlin.

Current State

LlmModule has prefillAudio(float[]/byte[], ...) and prefillRawAudio(byte[], ...) -- works but limited
AsrModule is encoder-decoder specific (Whisper-style) -- not applicable to decoder-only audio LLMs
Model type constants: only MODEL_TYPE_TEXT (1) and MODEL_TYPE_MULTIMODAL (2) -- no audio-specific semantics

Proposed Improvements

1. Add ByteBuffer variants for audio prefill (zero-copy parity with image API)

Image prefill has prefillImages(ByteBuffer, ...) for zero-copy JNI, but audio prefill only accepts arrays. For long audio (Voxtral handles multi-minute clips), the extra JNI copy is wasteful.
kotlin
// Missing today:
fun prefillAudio(audio: ByteBuffer, batchSize: Int, nBins: Int, nFrames: Int)
fun prefillRawAudio(audio: ByteBuffer, batchSize: Int, nChannels: Int, nSamples: Int)
2. Add WAV file path API to LlmModule

AsrModule can accept WAV paths directly (internally uses load_wav_audio_data()), but LlmModule requires the caller to manually decode audio in Java. For audio LLMs this should be as easy as:
kotlin
llmModule.prefillAudioFromFile("/path/to/audio.wav")
3. Add MODEL_TYPE_TEXT_AUDIO constant

Currently MODEL_TYPE_MULTIMODAL (aliased as MODEL_TYPE_TEXT_VISION) has vision-centric naming. Adding an explicit audio model type improves discoverability and documentation:
kotlin
const val MODEL_TYPE_TEXT_AUDIO = 3
4. Better raw audio type support

prefillRawAudio(byte[], ...) is awkward -- real PCM audio is typically short[] (16-bit) or float[] (32-bit). Add typed variants:
kotlin
fun prefillRawAudio(audio: ShortArray, batchSize: Int, nChannels: Int, nSamples: Int)  // PCM-16
fun prefillRawAudio(audio: FloatArray, batchSize: Int, nChannels: Int, nSamples: Int)  // float32
5. Audio-specific configuration in LlmModuleConfig

Add fields for audio preprocessing parameters:
kotlin
data class LlmModuleConfig(
    // ... existing fields ...
    val sampleRate: Int = 16000,
    val preprocessorPath: String? = null,  // optional .pte for mel spectrogram extraction
)
6. Unified multimodal generation entry point

Currently audio prefill and text generation are separate calls with no way to combine them in a single config. Consider a builder pattern:
kotlin
llmModule.generate {
    audio("/path/to/audio.wav")
    prompt("Transcribe the above audio:")
    maxSeqLen(512)
    onToken { token -> /* stream */ }
}

### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Android] Kotlin API improvements for audio LLM models #19817

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Android] Kotlin API improvements for audio LLM models #19817

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions