Optimize omni merge by WANDY666 · Pull Request #1255 · ModelTC/LightLLM

WANDY666 · 2026-04-03T04:56:38Z

Optimized the audio path

… to remove first-request audio cold- start latency.

…ize_omni_merge # Conflicts: # lightllm/models/qwen3_omni_moe_thinker/qwen3_omni_audio.py # lightllm/models/whisper/whisper_audio.py # lightllm/server/api_start.py # lightllm/server/audioserver/manager.py # lightllm/server/audioserver/model_infer/model_rpc.py # lightllm/server/httpserver/manager.py # lightllm/utils/multimodal_utils.py

gemini-code-assist

Code Review

This pull request introduces significant optimizations and features for audio multimodal processing, including the addition of Triton autotune kernel configurations for the RTX 5090, implementation of audio preloading and warmup mechanisms, and the introduction of a prompt encoding cache. The server components were updated to support audio-specific batch sizes and data parallelism, while the multimodal parameter handling was refactored to support shared memory formats and more efficient resource allocation. Review feedback highlighted a bug where the device was hardcoded to CUDA in audio preprocessing, and suggested offloading CPU-bound tasks like audio loading and MD5 hashing to threads to avoid blocking the asynchronous event loop.

gemini-code-assist · 2026-04-03T04:57:50Z

lightllm/models/qwen3_omni_moe_thinker/audio_process.py

+        compact_features = torch.from_numpy(extracted[:, :, :num_frames]).to(device="cuda", dtype=torch.bfloat16)
+        compact_features = compact_features[0].contiguous()
+        feature_lens = torch.tensor([num_frames], device="cuda", dtype=torch.long)


The device argument passed to _preprocess_single_padded is ignored when creating the compact_features and feature_lens tensors, as they are hardcoded to "cuda". This will cause failures if the intended device is "cpu" (which is the default value for the device parameter) or a specific GPU index.

Suggested change

compact_features = torch.from_numpy(extracted[:, :, :num_frames]).to(device="cuda", dtype=torch.bfloat16)

compact_features = compact_features[0].contiguous()

feature_lens = torch.tensor([num_frames], device="cuda", dtype=torch.long)

compact_features = torch.from_numpy(extracted[:, :, :num_frames]).to(device=device, dtype=torch.bfloat16)

compact_features = compact_features[0].contiguous()

feature_lens = torch.tensor([num_frames], device=device, dtype=torch.long)

gemini-code-assist · 2026-04-03T04:57:50Z

lightllm/server/multimodal_params.py

-            audio_values, _ = librosa.load(BytesIO(audio_data), sr=16000)
-            from lightllm.models.whisper.defaults import MIN_AUDIO_LEN
+            decode_start = time.time()
+            audio_values, _ = librosa.load(BytesIO(audio_data), sr=target_sample_rate)


librosa.load is a CPU-bound blocking operation. Calling it directly within an async method like preload will block the event loop, preventing the server from handling other concurrent requests. This is particularly impactful for long audio files. Consider offloading this to a thread pool using asyncio.to_thread (for Python 3.9+) or loop.run_in_executor.

Suggested change

audio_values, _ = librosa.load(BytesIO(audio_data), sr=target_sample_rate)

audio_values, _ = await asyncio.to_thread(librosa.load, BytesIO(audio_data), sr=target_sample_rate)

gemini-code-assist · 2026-04-03T04:57:50Z

lightllm/server/httpserver/manager.py

+            self.tokenizer.init_imageitem_extral_params(img, multimodal_params, sampling_params)
+            data = img.read()
+            token_num = self.tokenizer.get_image_token_length(img)
+            md5sum = hashlib.md5(data).hexdigest() + "_" + str(hash(frozendict(img.extra_params)))


Calculating the MD5 hash of image data is a CPU-bound operation that can block the event loop, especially for large images. Since the preload method already runs asynchronously, it would be more efficient to calculate and store the MD5 hash in the ImageItem during the preloading phase (similar to the implementation for AudioItem) and reuse it here to avoid blocking the main server loop.

WANDY666 added 19 commits March 26, 2026 08:17

qwen3_vl_moe support prefill_cudagraph

c8b3888

add audio dp

e7fba3a

Add startup warmups for HTTP audio preload and per-rank audio workers…

671b5aa

… to remove first-request audio cold- start latency.

add http client cache

a387259

reduce polling time

cd89cd6

Optimize audio shm payload handling and cache lookups

4788980

cache hann_window/mel_filters

7b05403

Fix audio preload config to follow tokenizer settings

713c45d

Optimize qwen3 omni audio preprocessing fast path

65a3ec6

Add audio server fast path for single pending requests

2e48008

fix num_frames

456a71a

tune fp8

479367d

set default model

2c09aa2

add prompt_text_cache to QWen3OmniTokenizer

5168dae

multi images or audios use asyncio

167f8b0

single file without _resource_lock

30d8603

use deque instead of list

db3e63b

chore: format merged audio/httpserver files

878c2f9

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

WANDY666 added 3 commits April 3, 2026 04:59

chore: improve qwen3 omni audio formatting

ab788d9

fixâ�

0570b96

fix

70aad72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize omni merge#1255

Optimize omni merge#1255
WANDY666 wants to merge 22 commits intomainfrom
optimize_omni_merge

WANDY666 commented Apr 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	audio_values, _ = librosa.load(BytesIO(audio_data), sr=target_sample_rate)
	audio_values, _ = await asyncio.to_thread(librosa.load, BytesIO(audio_data), sr=target_sample_rate)

Conversation

WANDY666 commented Apr 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant