Skip to content

[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817

Open
lizexu123 wants to merge 36 commits into
PaddlePaddle:developfrom
lizexu123:kkc
Open

[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817
lizexu123 wants to merge 36 commits into
PaddlePaddle:developfrom
lizexu123:kkc

Conversation

@lizexu123
Copy link
Copy Markdown
Collaborator

@lizexu123 lizexu123 commented May 14, 2026

Motivation

1、修复在eb5跑fp4时,audio_token_num为None,导致会判断 NoneType >0的bug,以及加载eb5旗舰版的问题
支持fp4 通信量化,以hidden_size = 7168为例子

2、当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的,粒度较粗,存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制,支持在单个算子/层级别(如 Linear、RMSNorm)独立捕获和回放 CUDA Graph,从而实现更细粒度的图优化,提升 prefill 阶段的推理性能。

3、支持block_wise_fp8 dense在线量化+nvfp4离线量化配置
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}' \

4、支持fp4 deepep通信
开启fp4通信量化 export FD_DISPATCH_USE_FP4=1

# 启用 block-wise CUDA Graph
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1

# 自定义预捕获的 token 数(可选)
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# 如果想确认prefill哪些linear进入cuda_graph
export FD_BLOCK_WISE_DEBUG=1

# 开启fp4通信量化
export FD_USE_NVFP4_COMM_QUANT=1

支持了prefill阶段进cuda_graph,kernel间空隙有所减少,如下图所示。
image
上图为之前的空隙
image
优化后基本无空隙

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 14, 2026

Thanks for your contribution!

@lizexu123 lizexu123 changed the title Kkc [Feature] Support FP4 communication quantization and block_wise_cuda_graph May 14, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 14, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-20 17:08:12

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required 尚未完成:1 个 Required 任务需要人工 Approval,1 个 Required 任务仍在运行(Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage)。建议暂不合并,先完成审批并等待主测试任务结束;Optional 失败 3 个仅供参考。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 35 4 1 1 0

2 任务状态汇总

日志列说明:失败任务直接使用工具预生成的日志链接;运行中任务链接到对应 Workflow。

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s 需要 Approval:等待人工审批 请通过人工审批 Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 等待任务完成 Workflow -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 27/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 2m44s Job -
Check PR Template 11s Job -
Trigger Jenkins for PR 8m42s Job -
⏸️ CI_HPU - - -
其余 27 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 需要人工审批(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

  • 根因摘要:Approval Workflow 未完成人工审批,导致 Required Job 阻塞。
  • 修复建议:请在 PR Checks / GitHub Actions 中完成审批后继续 CI。

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 14, 2026

Codecov Report

❌ Patch coverage is 28.57143% with 35 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d54e207). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...deploy/model_executor/layers/quantization/nvfp4.py 0.00% 16 Missing ⚠️
...loy/model_executor/layers/quantization/__init__.py 16.66% 8 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py 33.33% 5 Missing and 1 partial ⚠️
fastdeploy/model_executor/utils.py 81.81% 1 Missing and 1 partial ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7817   +/-   ##
==========================================
  Coverage           ?   63.26%           
==========================================
  Files              ?      462           
  Lines              ?    64564           
  Branches           ?     9936           
==========================================
  Hits               ?    40847           
  Misses             ?    20943           
  Partials           ?     2774           
Flag Coverage Δ
GPU 72.51% <28.57%> (?)
XPU 7.12% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lizexu123 lizexu123 changed the title [Feature] Support FP4 communication quantization and block_wise_cuda_graph [Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4 May 15, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

zhoutianzi666
zhoutianzi666 previously approved these changes May 20, 2026
PaddlePaddle-bot

This comment was marked as outdated.

zhoutianzi666
zhoutianzi666 previously approved these changes May 20, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-20 16:02:41

📋 Review 摘要

PR 概述:支持 FP4 通信量化(DeepEP prefill dispatch)、block_wise_fp8 + modelopt_fp4 混合量化配置,以及修复 eb5 旗舰版加载 bug。
变更范围custom_ops/gpu_ops/moe/fastdeploy/model_executor/layers/quantization/fastdeploy/model_executor/utils.pyfastdeploy/envs.pydocs/
影响面 Tag[Quantization] [OP] [Feature]

问题

级别 文件 概述
❓ 疑问 fastdeploy/model_executor/layers/quantization/nvfp4.py:432 apply() 中 3 个 dtype 断言被无声移除,原因不明
🟡 建议 fastdeploy/model_executor/layers/quantization/nvfp4.py:670 fc1_latent_proj / fc2_latent_proj 三处签名均添加但函数体内从未使用(dead parameter)
🟡 建议 tests/operators/test_permute_prefill_masked_gemm.py:49 make_scale_interleaved=True 新路径无算子单测覆盖
📝 PR 规范 PR 描述 ## Modifications 为空;## Accuracy Tests 为空;PR 描述 bash block 中 FD_USE_NVFP4_COMM_QUANT 与代码实现 FD_DISPATCH_USE_FP4 命名不一致

📝 PR 规范检查

  1. 标题[Feature] 为合法官方 Tag,格式合规,无需修改。
  2. 描述结构## Modifications 段落内容为空;## Accuracy Tests 仅有模板注释,未填写实际内容;Checklist 条目均未勾选。
  3. env var 命名不一致:PR 描述 bash block 中写的是 export FD_USE_NVFP4_COMM_QUANT=1,但 envs.py 和代码实现均为 FD_DISPATCH_USE_FP4,用户照 PR 描述操作将无效。

PR 描述建议(可直接复制):

## Motivation
1. 修复 eb5 旗舰版加载时 `audio_token_num``None` 导致 `NoneType > 0` 的 bug,支持 mix_quant 覆盖 offline NVFP4 checkpoint 的加载流程。
2. 支持 FP4 通信量化(DeepEP prefill dispatch),通过 FP4 pre-quantize 后再 dispatch,通信量减少约 2x vs BF16。
3. 支持 block_wise_fp8 dense 在线量化 + modelopt_fp4 MoE 离线量化混合配置(mix_quant)。
4. 引入 Block-wise CUDA Graph 机制,支持 prefill 阶段算子级别独立捕获和回放,减少 kernel 间空隙。

## Modifications
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`:新增 `MAKE_SCALE_INTERLEAVED` 模板参数,支持将 FP8 scale 直接写入 flashinfer cutedsl swizzled layout;新增 UINT8 dtype dispatch 分支。
- `custom_ops/gpu_ops/cpp_extensions.cc`:同步更新 `PrefillPermuteToMaskedGemm` 签名,添加 `make_scale_interleaved: bool` 参数(默认 false,向后兼容)。
- `fastdeploy/envs.py`:新增 `FD_DISPATCH_USE_FP4` 环境变量,控制 prefill dispatch 是否使用 FP4 通信量化。
- `fastdeploy/model_executor/layers/quantization/nvfp4.py``apply_ep_prefill` 新增 FP4 pre-quantize dispatch 路径;`call_prefill_permute_to_masked_gemm` 新增 `make_scale_interleaved` 参数透传。
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`:新增 `moe_quant_config` 参数和 `_build_moe_sub_config` 方法,支持 MoE offline FP4 + dense online quant 混合配置。
- `fastdeploy/model_executor/layers/quantization/__init__.py`:新增 `mix_quant_overrides_nvfp4` 分支,处理 CLI mix_quant 覆盖 checkpoint NVFP4 config 的场景。
- `fastdeploy/model_executor/utils.py``process_weight_transpose` 添加未初始化权重守卫;hybrid mix_quant 场景下跳过在线层增量 hook,延迟至 `process_final_after_loading`- `docs/`:更新中英文 nvfp4 文档,添加 FP4 通信量化和 block-wise CUDA graph 配置说明。
- `tests/`:新增 `TestFlashInferCuteDSLMoEHelpers``TestFlashInferCuteDSLMoEMasked` 测试类,覆盖 pre-quantized FP4 路径和标准 BF16 路径。

## Usage or Command
```bash
# FP4 通信量化(DeepEP prefill dispatch)
export FD_DISPATCH_USE_FP4=1

# Block-wise CUDA Graph(prefill 阶段)
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# 混合量化(dense block_wise_fp8 + MoE modelopt_fp4)启动示例
python -m fastdeploy.entrypoints.openai.multi_api_server \
       --quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'
```

## Accuracy Tests
N/A(本 PR 主要为通信量化与权重加载流程变更。FP4 通信量化 vs BF16 基线的 accuracy delta 待补充;block_wise_fp8 dense + modelopt_fp4 MoE 混合路径的精度测试待补充。)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本 PR 在 FP4 通信量化、MoE offline+dense online 混合量化配置、以及权重加载兼容性上有较完善的实现,新增单测覆盖主要路径。需关注 3 个 dtype 断言的静默移除(建议补充条件守卫或说明)以及 fc1/fc2_latent_proj 死参数的清理,PR 描述中 env var 名称与代码实现不一致需修正后再合入。

assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn
assert layer.alpha.dtype == paddle.float32

if self.backend.startswith("flashinfer-"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 apply() 中 3 个 dtype 断言被无声移除

原有断言:

assert layer.weight.dtype == paddle.uint8
assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn
assert layer.alpha.dtype == paddle.float32

被移除但无注释说明原因。这些断言是加载到运行期 dtype 的首道防线,静默移除后若 MixQuant 混合路径下权重 dtype 不对将直接产生 NaN 而无任何提示。

建议:若移除是因为 hybrid mix_quant 路径下 weight dtype 可能不是 uint8(如 dense 层走 block_wise_fp8),请加条件守卫:

# Only assert for pure offline-FP4 path
if not isinstance(self.quant_config, MixQuantConfig):
    assert layer.weight.dtype == paddle.uint8
    assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn
    assert layer.alpha.dtype == paddle.float32

或在 PR 描述中说明为何安全移除。

topk_ids_hookfunc: Callable = None,
shared_experts: nn.Layer = None,
fc1_latent_proj: nn.Layer = None,
fc2_latent_proj: nn.Layer = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 fc1_latent_proj / fc2_latent_proj 参数已添加到 apply_ep_prefillapply_ep_decodeapply_ep_decode_fwd 三处函数签名,但在三个函数体内均未使用

这是死代码(dead parameter),会给调用方带来困惑,也无法通过静态分析验证其语义。

建议:

  • 若为预留接口(为后续 MLA 或 latent proj 路径准备),请加 # TODO: reserved for future MLA path 注释并在 PR 描述中说明;
  • 若当前实现不需要,请暂不添加,待实际使用时再引入,保持接口干净。

topk_ids = topk_ids.cast(paddle.int64)

results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num)
results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num, False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 make_scale_interleaved=True 新路径无算子单测

当前测试仅覆盖 make_scale_interleaved=False(原有路径)。本 PR 核心新增了 FP4 comm quant 的 interleaved scale 写入路径(CUDA kernel 中大段新逻辑),建议在 tests/operators/ 中补充 make_scale_interleaved=True 的单测,验证:

  1. 输出 tensor shape 正确([E, M, hidden_scale] contiguous layout vs 旧版转置 layout)
  2. interleaved 写入后的数值与参考实现一致(可用 CPU 参考实现对比)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants