[PERF] Optimize LingBot-World-Fast chunk profiling and runtime paths#3
Merged
lzx1413 merged 8 commits intoJun 30, 2026
Merged
Conversation
…ing in runtime and service
Collaborator
|
LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds finer-grained LingBot-World-Fast profiling controls and improves chunk generation performance by reducing avoidable layout conversions, removing eager Triton LayerNorm wrapper overhead, and avoiding CUDA scalar index synchronizations in the self-attention KV cache. It also exposes local attention window settings through the pipeline config so the runtime can size and use the self-KV cache according to the requested window.
Motivation
Profiling showed several costs that were either hard to attribute or avoidable in the current LingBot-World-Fast runtime:
record_shapes,profile_memory,with_stack) can add high overhead and distort pipeline-level traces.CausalConv3dand DiT patch embedding receive NCDHW inputs while cuDNN Conv3d kernels preferchannels_last_3d; on current PyTorch/cuDNN this does not fall back toslow_conv_dilated3d, but it still triggers repeated implicit NCHW/NHWC layout transforms..item()reads from CUDA tensors, introducing device-to-host synchronization.Type of Change
Changes Made
Profiling controls and trace ranges
TELEFUSER_TORCH_PROFILER_RECORD_SHAPESTELEFUSER_TORCH_PROFILER_PROFILE_MEMORYTELEFUSER_TORCH_PROFILER_WITH_STACKProfilingContext4Debugranges forworkloop,create_runtime,generate_next_chunk,denoise_chunk,kv_cache_update_forward, andvae_decode.LingBot-World-Fast local attention configuration
local_attn_sizeandsink_sizetoLingBotWorldFastPipelineConfig.LingBotWorldFastDiT.from_pretrained.local_attn_size > -1.Conv3d layout optimization
torch.channels_last_3dbeforeself.patch_embedding.WanVideoVAE.CausalConv3dinput totorch.channels_last_3dbeforenn.Conv3d.forward.slow_conv_dilated3davoidance in the current environment. With torch2.12.1+cu130and cuDNN9.20, baseline already uses cuDNN Conv3d. The current gain comes from trading one explicit DtoD layout copy for many fewer implicit cuDNN NCHW/NHWC transforms.LayerNorm eager path
LayerNorm.forward_cudato the native PyTorch implementation in eager mode.torch.library.wrap_triton/ HOP / Triton autotuner wrapper cost on small LayerNorm kernels.KV cache index synchronization
global_end_indexandlocal_end_indexas host Pythonintvalues instead of CUDA tensors..item()-driven DtoH syncs inCausalSelfAttention.forward.Packaging/import robustness
__version__ = "0.0.0+unknown"whentelefuser._versionis absent in a source checkout.Tests added
tests/unit/utils/test_profiler_flags.pyTesting
Result: passed, 2 tests.
Checklist
ruff)pre-commit run --all-files)pytest tests/)[TYPE] Brief descriptionRelated Issues
N/A
Additional Notes
2.9.1+ cuDNN9.10could route bf16/fp16 5D Conv3d toaten::slow_conv_dilated3d. The current benchmark environment is different: torch2.12.1+cu130, CUDA13.0, cuDNN9.20, H100. In this environment, the baseline no longer goes throughslow_conv_dilated3d; the observed VAE win is from reducing implicit layout transforms.GPU Architecture Support
No new custom CUDA/Triton kernels are added. Runtime measurements in this draft were collected on NVIDIA H100 GPUs. The code changes use PyTorch memory-format and native operator paths, so there is no new architecture-specific kernel support matrix to validate.
Performance Impact
Primary no-profiler benchmark:
03frame_num=201chunk_size=3local_attn_size=21sink_size=3max_area=399360--no-write-video2.12.1+cu13013.09.20generate_next_chunk_seconds.meandenoise_seconds.meanupdate_cache_seconds.meandecode_seconds.meantotal_seconds.meanAt this resolution/config, decode accounts for roughly 86% of the steady-state
generate_next_chunkimprovement.Profiler trace attribution:
03frame_num=89chunk_size=3local_attn_size=21sink_size=3max_area=99840create_runtime,generate_next_chunktotal_secondsis dominated by profiler overhead. Use stage timing and GPU kernel attribution, not end-to-end profiler wall time.Profiler timing summary:
generate_next_chunk_seconds.meandenoise_seconds.meanupdate_cache_seconds.meandecode_seconds.meanDiT GPU attribution from
analyze_telefuser_dit_profile.py:The DiT GPU kernel time is not the source of the current speedup in this trace.
VAE/GPU0 copy-layout attribution:
torch_direct_copy_kernelcudnn_nchw_to_nhwccudnn_nhwc_to_nchwMemcpy DtoDInterpretation: the explicit
x.contiguous(memory_format=torch.channels_last_3d)increases visibleMemcpy DtoD, but it removes more implicit cuDNN NCHW/NHWC transforms and direct-copy work. This is why the modified branch can show higherMemcpy DtoDwhile still reducing total VAE decode time.Supplemental Trace Figures
The following profiler screenshots are included as supplementary evidence. The numeric benchmark tables above remain the source of truth for this PR.
Historical VAE Conv3d trace from the earlier PyTorch/cuDNN environment. This explains why the
channels_last_3dchange was originally investigated. It should not be read as the current torch2.12.1+cu130behavior, where baseline no longer falls back toslow_conv_dilated3d.Profiler-side
generate_next_chunk/ VAE timing view, showing the stage-level before/after context that motivated the VAE decode attribution.LayerNorm profiler hotspot. The linked GPU kernel is only tens of microseconds, while the visible range is dominated by eager
wrap_triton/ HOP / autotuner wrapper cost.LayerNorm fix context: route eager execution to the native PyTorch implementation instead of the small Triton wrapper path.
KV cache index trace showing DtoH synchronization from reading CUDA scalar index tensors with
.item(). The PR changes those indices to Pythonintvalues in the runtime path.