[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4 by lizexu123 · Pull Request #7817 · PaddlePaddle/FastDeploy

lizexu123 · 2026-05-14T09:21:01Z

Motivation

1、修复在eb5跑fp4时，audio_token_num为None，导致会判断 NoneType >0的bug，以及加载eb5旗舰版的问题
支持fp4 通信量化,以hidden_size = 7168为例子

2、当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的，粒度较粗，存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制，支持在单个算子/层级别（如 Linear、RMSNorm）独立捕获和回放 CUDA Graph，从而实现更细粒度的图优化，提升 prefill 阶段的推理性能。

3、支持block_wise_fp8 dense在线量化+nvfp4离线量化配置
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}' \

4、支持fp4 deepep通信
开启fp4通信量化 export FD_DISPATCH_USE_FP4=1

# 启用 block-wise CUDA Graph
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1

# 自定义预捕获的 token 数（可选）
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# 如果想确认prefill哪些linear进入cuda_graph
export FD_BLOCK_WISE_DEBUG=1

# 开启fp4通信量化
export FD_USE_NVFP4_COMM_QUANT=1

支持了prefill阶段进cuda_graph,kernel间空隙有所减少,如下图所示。

上图为之前的空隙

优化后基本无空隙

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…into kc_d

paddle-bot · 2026-05-14T09:21:08Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-14T09:55:37Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-20 17:08:12

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 9654004
Merge base: d54e207 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前 Required 尚未完成：1 个 Required 任务需要人工 Approval，1 个 Required 任务仍在运行（Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage）。建议暂不合并，先完成审批并等待主测试任务结束；Optional 失败 3 个仅供参考。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	35	4	1	1	0

2 任务状态汇总

日志列说明：失败任务直接使用工具预生成的日志链接；运行中任务链接到对应 Workflow。

2.1 Required任务 : 8/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	7s	需要 Approval：等待人工审批	请通过人工审批	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	等待任务完成	Workflow	-
✅	其余 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 27/31 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	2m44s	Job	-
❌	`Check PR Template`	11s	Job	-
❌	`Trigger Jenkins for PR`	8m42s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 27 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 需要人工审批（置信度: 高）

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

根因摘要：Approval Workflow 未完成人工审批，导致 Required Job 阻塞。
修复建议：请在 PR Checks / GitHub Actions 中完成审批后继续 CI。

codecov-commenter · 2026-05-14T10:08:49Z

Codecov Report

❌ Patch coverage is 28.57143% with 35 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d54e207). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...deploy/model_executor/layers/quantization/nvfp4.py	0.00%	16 Missing ⚠️
...loy/model_executor/layers/quantization/__init__.py	16.66%	8 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py	33.33%	5 Missing and 1 partial ⚠️
fastdeploy/model_executor/utils.py	81.81%	1 Missing and 1 partial ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7817   +/-   ##
==========================================
  Coverage           ?   63.26%           
==========================================
  Files              ?      462           
  Lines              ?    64564           
  Branches           ?     9936           
==========================================
  Hits               ?    40847           
  Misses             ?    20943           
  Partials           ?     2774

Flag	Coverage Δ
GPU	`72.51% <28.57%> (?)`
XPU	`7.12% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into kkc

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-20 16:02:41

📋 Review 摘要

PR 概述：支持 FP4 通信量化（DeepEP prefill dispatch）、block_wise_fp8 + modelopt_fp4 混合量化配置，以及修复 eb5 旗舰版加载 bug。
变更范围：custom_ops/gpu_ops/moe/、fastdeploy/model_executor/layers/quantization/、fastdeploy/model_executor/utils.py、fastdeploy/envs.py、docs/
影响面 Tag：[Quantization] [OP] [Feature]

问题

级别	文件	概述
❓ 疑问	`fastdeploy/model_executor/layers/quantization/nvfp4.py:432`	`apply()` 中 3 个 dtype 断言被无声移除，原因不明
🟡 建议	`fastdeploy/model_executor/layers/quantization/nvfp4.py:670`	`fc1_latent_proj` / `fc2_latent_proj` 三处签名均添加但函数体内从未使用（dead parameter）
🟡 建议	`tests/operators/test_permute_prefill_masked_gemm.py:49`	`make_scale_interleaved=True` 新路径无算子单测覆盖
📝 PR 规范	PR 描述	`## Modifications` 为空；`## Accuracy Tests` 为空；PR 描述 bash block 中 `FD_USE_NVFP4_COMM_QUANT` 与代码实现 `FD_DISPATCH_USE_FP4` 命名不一致

📝 PR 规范检查

标题：[Feature] 为合法官方 Tag，格式合规，无需修改。
描述结构：## Modifications 段落内容为空；## Accuracy Tests 仅有模板注释，未填写实际内容；Checklist 条目均未勾选。
env var 命名不一致：PR 描述 bash block 中写的是 export FD_USE_NVFP4_COMM_QUANT=1，但 envs.py 和代码实现均为 FD_DISPATCH_USE_FP4，用户照 PR 描述操作将无效。

PR 描述建议（可直接复制）：

## Motivation
1. 修复 eb5 旗舰版加载时 `audio_token_num` 为 `None` 导致 `NoneType > 0` 的 bug，支持 mix_quant 覆盖 offline NVFP4 checkpoint 的加载流程。
2. 支持 FP4 通信量化（DeepEP prefill dispatch），通过 FP4 pre-quantize 后再 dispatch，通信量减少约 2x vs BF16。
3. 支持 block_wise_fp8 dense 在线量化 + modelopt_fp4 MoE 离线量化混合配置（mix_quant）。
4. 引入 Block-wise CUDA Graph 机制，支持 prefill 阶段算子级别独立捕获和回放，减少 kernel 间空隙。

## Modifications
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`：新增 `MAKE_SCALE_INTERLEAVED` 模板参数，支持将 FP8 scale 直接写入 flashinfer cutedsl swizzled layout；新增 UINT8 dtype dispatch 分支。
- `custom_ops/gpu_ops/cpp_extensions.cc`：同步更新 `PrefillPermuteToMaskedGemm` 签名，添加 `make_scale_interleaved: bool` 参数（默认 false，向后兼容）。
- `fastdeploy/envs.py`：新增 `FD_DISPATCH_USE_FP4` 环境变量，控制 prefill dispatch 是否使用 FP4 通信量化。
- `fastdeploy/model_executor/layers/quantization/nvfp4.py`：`apply_ep_prefill` 新增 FP4 pre-quantize dispatch 路径；`call_prefill_permute_to_masked_gemm` 新增 `make_scale_interleaved` 参数透传。
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`：新增 `moe_quant_config` 参数和 `_build_moe_sub_config` 方法，支持 MoE offline FP4 + dense online quant 混合配置。
- `fastdeploy/model_executor/layers/quantization/__init__.py`：新增 `mix_quant_overrides_nvfp4` 分支，处理 CLI mix_quant 覆盖 checkpoint NVFP4 config 的场景。
- `fastdeploy/model_executor/utils.py`：`process_weight_transpose` 添加未初始化权重守卫；hybrid mix_quant 场景下跳过在线层增量 hook，延迟至 `process_final_after_loading`。
- `docs/`：更新中英文 nvfp4 文档，添加 FP4 通信量化和 block-wise CUDA graph 配置说明。
- `tests/`：新增 `TestFlashInferCuteDSLMoEHelpers` 和 `TestFlashInferCuteDSLMoEMasked` 测试类，覆盖 pre-quantized FP4 路径和标准 BF16 路径。

## Usage or Command
```bash
# FP4 通信量化（DeepEP prefill dispatch）
export FD_DISPATCH_USE_FP4=1

# Block-wise CUDA Graph（prefill 阶段）
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# 混合量化（dense block_wise_fp8 + MoE modelopt_fp4）启动示例
python -m fastdeploy.entrypoints.openai.multi_api_server \
       --quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'
```

## Accuracy Tests
N/A（本 PR 主要为通信量化与权重加载流程变更。FP4 通信量化 vs BF16 基线的 accuracy delta 待补充；block_wise_fp8 dense + modelopt_fp4 MoE 混合路径的精度测试待补充。）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本 PR 在 FP4 通信量化、MoE offline+dense online 混合量化配置、以及权重加载兼容性上有较完善的实现，新增单测覆盖主要路径。需关注 3 个 dtype 断言的静默移除（建议补充条件守卫或说明）以及 fc1/fc2_latent_proj 死参数的清理，PR 描述中 env var 名称与代码实现不一致需修正后再合入。

PaddlePaddle-bot · 2026-05-20T08:09:45Z

-        assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn
-        assert layer.alpha.dtype == paddle.float32
-
        if self.backend.startswith("flashinfer-"):


❓ 疑问 apply() 中 3 个 dtype 断言被无声移除

原有断言：

assert layer.weight.dtype == paddle.uint8 assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn assert layer.alpha.dtype == paddle.float32

被移除但无注释说明原因。这些断言是加载到运行期 dtype 的首道防线，静默移除后若 MixQuant 混合路径下权重 dtype 不对将直接产生 NaN 而无任何提示。

建议：若移除是因为 hybrid mix_quant 路径下 weight dtype 可能不是 uint8（如 dense 层走 block_wise_fp8），请加条件守卫：

# Only assert for pure offline-FP4 path if not isinstance(self.quant_config, MixQuantConfig): assert layer.weight.dtype == paddle.uint8 assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn assert layer.alpha.dtype == paddle.float32

或在 PR 描述中说明为何安全移除。

PaddlePaddle-bot · 2026-05-20T08:09:45Z

        topk_ids_hookfunc: Callable = None,
        shared_experts: nn.Layer = None,
+        fc1_latent_proj: nn.Layer = None,
+        fc2_latent_proj: nn.Layer = None,


🟡 建议 fc1_latent_proj / fc2_latent_proj 参数已添加到 apply_ep_prefill、apply_ep_decode、apply_ep_decode_fwd 三处函数签名，但在三个函数体内均未使用。

这是死代码（dead parameter），会给调用方带来困惑，也无法通过静态分析验证其语义。

建议：

若为预留接口（为后续 MLA 或 latent proj 路径准备），请加 # TODO: reserved for future MLA path 注释并在 PR 描述中说明；

若当前实现不需要，请暂不添加，待实际使用时再引入，保持接口干净。

PaddlePaddle-bot · 2026-05-20T08:09:45Z

        topk_ids = topk_ids.cast(paddle.int64)

-    results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num)
+    results = prefill_permute_to_masked_gemm(x, scale, topk_ids, num_local_experts, max_token_num, False)


🟡 建议 make_scale_interleaved=True 新路径无算子单测

当前测试仅覆盖 make_scale_interleaved=False（原有路径）。本 PR 核心新增了 FP4 comm quant 的 interleaved scale 写入路径（CUDA kernel 中大段新逻辑），建议在 tests/operators/ 中补充 make_scale_interleaved=True 的单测，验证：

输出 tensor shape 正确（[E, M, hidden_scale] contiguous layout vs 旧版转置 layout）

interleaved 写入后的数值与参考实现一致（可用 CPU 参考实现对比）

lonelygsh added 13 commits April 15, 2026 15:15

support eb5 fp4 cuda_graph

9f6c3c0

update

55d1a05

merge develop

3509714

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

deebd2a

…into kc_d

Support FP4 communication quantization

dd4118d

fix

3fdbc08

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

1226b27

…into kc_d

update

6c3cc4b

fix

e89dff7

support mix_quant and nvfp4

24d07c6

support prefill cuda_graph

19a7019

support fp4 communication quantization

842feba

support

141ac55

lizexu123 had a problem deploying to Metax_ci May 14, 2026 09:21 — with GitHub Actions Failure

lizexu123 changed the title ~~Kkc~~ [Feature] Support FP4 communication quantization and block_wise_cuda_graph May 14, 2026

This comment was marked as outdated.

Sign in to view

fix

4c076ce

lizexu123 had a problem deploying to Metax_ci May 15, 2026 07:16 — with GitHub Actions Error

fix

b643683

lizexu123 temporarily deployed to Metax_ci May 15, 2026 07:17 — with GitHub Actions Inactive

lizexu123 changed the title ~~[Feature] Support FP4 communication quantization and block_wise_cuda_graph~~ [Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4 May 15, 2026

This comment was marked as outdated.

Sign in to view

add test

2f4151c

lizexu123 had a problem deploying to Metax_ci May 15, 2026 08:30 — with GitHub Actions Failure

merge develop

22ac5a0

lizexu123 had a problem deploying to Metax_ci May 15, 2026 08:32 — with GitHub Actions Failure

update develop

8443d62

fix

7a58d12

lizexu123 had a problem deploying to Metax_ci May 19, 2026 03:46 — with GitHub Actions Error

revert helper.h to develop

aad041d

lizexu123 had a problem deploying to Metax_ci May 19, 2026 03:50 — with GitHub Actions Failure

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

fb82351

…into kkc

lizexu123 had a problem deploying to Metax_ci May 19, 2026 05:47 — with GitHub Actions Failure

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

e282850

…into kkc

lizexu123 temporarily deployed to Metax_ci May 19, 2026 07:33 — with GitHub Actions Inactive

make_scale_interleaved

a96f2a5

lizexu123 had a problem deploying to Metax_ci May 19, 2026 09:09 — with GitHub Actions Error

fix

e20676d

lizexu123 temporarily deployed to Metax_ci May 19, 2026 09:12 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

zhoutianzi666 previously approved these changes May 20, 2026

View reviewed changes

fix

5284879

lizexu123 dismissed zhoutianzi666’s stale review via 5284879 May 20, 2026 03:50

lizexu123 had a problem deploying to Metax_ci May 20, 2026 03:50 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

zhoutianzi666 previously approved these changes May 20, 2026

View reviewed changes

Merge remote-tracking branch 'origin/develop' into kkc

3a80481

lizexu123 had a problem deploying to Metax_ci May 20, 2026 07:42 — with GitHub Actions Error

fix

c22d3ea

lizexu123 dismissed zhoutianzi666’s stale review via c22d3ea May 20, 2026 07:44

lizexu123 had a problem deploying to Metax_ci May 20, 2026 07:44 — with GitHub Actions Error

lizexu123 added 2 commits May 20, 2026 15:46

fix

7eb643b

fix

9654004

lizexu123 had a problem deploying to Metax_ci May 20, 2026 07:47 — with GitHub Actions Error

lizexu123 had a problem deploying to Metax_ci May 20, 2026 07:48 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 20, 2026

View reviewed changes

zhoutianzi666 approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817

[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817
lizexu123 wants to merge 36 commits into
PaddlePaddle:developfrom
lizexu123:kkc

lizexu123 commented May 14, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 14, 2026

Uh oh!

PaddlePaddle-bot commented May 14, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 14, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 20, 2026

Uh oh!

PaddlePaddle-bot May 20, 2026

Uh oh!

PaddlePaddle-bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lizexu123 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 14, 2026

Uh oh!

PaddlePaddle-bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/10 通过

2.2 可选任务 — 27/31 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lizexu123 commented May 14, 2026 •

edited

Loading

PaddlePaddle-bot commented May 14, 2026 •

edited

Loading

codecov-commenter commented May 14, 2026 •

edited

Loading