[PD] PD send cache via storage & Refine swap_cache_layout op by juncaipeng · Pull Request #7839 · PaddlePaddle/FastDeploy

juncaipeng · 2026-05-17T01:07:41Z

Motivation

在 PD 分离场景下，原方案 P 实例通过 RDMA/IPC 直传 KV Cache 到 D 实例 GPU，耦合度较高。本 PR 新增 Storage Pool 模式（FD_PD_TRANSFER_VIA_STORAGE=1），P 将 KV Cache 写入全局存储池（Mooncake），D 从存储读取，解耦 P/D 直连依赖，提升部署灵活性。同时重构 swap_cache_layout CUDA 算子，CPU Block ID 连续时合并为单次 DMA + Scatter Kernel，降低 PCIe 传输开销。

Modifications

fastdeploy/envs.py：新增 FD_PD_TRANSFER_VIA_STORAGE（0=直传，1=存储池）
fastdeploy/cache_manager/prefix_cache_manager.py：新增 write_all_cache_to_storage()（支持最后不完整 block，key 加 :partial:N 后缀）和 read_cache_from_storage_for_pd()
fastdeploy/cache_manager/cache_messager.py：存储池模式下跳过 RDMA 传输，直接 mark finished
fastdeploy/output/token_processor.py：P 端 prefill 结束后直接写存储再发 first_token
fastdeploy/engine/common_engine.py：D 端收到 first_token 后从存储读 Cache，失败返回 502
fastdeploy/engine/common_engine_prepare_mixin.py：存储池模式跳过 send_cache_info_to_messager
fastdeploy/engine/sched/resource_manager_v1.py：存储池模式仅 D 端写 Cache
fastdeploy/model_executor/layers/attention/*.py：存储池模式跳过 kv_signal 初始化
custom_ops/gpu_ops/swap_cache_layout.cu：重构算子，连续 Block 走 D2H kernel / H2D staging+scatter 优化路径

Usage or Command

export FD_PD_TRANSFER_VIA_STORAGE=1
export MOONCAKE_MASTER_SERVER_ADDR=<master_ip>:<port>
bash examples/cache_storage/run_03b_pd_storage.sh

Accuracy Tests

N/A

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-17T01:07:58Z

Thanks for your contribution!

Copilot

Pull request overview

本 PR 在 PD 分离场景中新增「Storage Pool 模式」：P 侧不再通过 RDMA/IPC 直接把 KVCache 推到 D 侧 GPU，而是写入全局存储池，由 D 侧从存储池读回，依靠环境变量 FD_PD_TRANSFER_VIA_STORAGE 开关。同时对 swap_cache_layout.cu 进行重构，引入按 swap block 聚合所有 layer 的 D2H/H2D kernel 与全局 staging/指针缓冲，以提升交换性能。

Changes:

新增 FD_PD_TRANSFER_VIA_STORAGE 环境变量及在 P 侧 token_processor、cache_messager、D 侧 common_engine、resource_manager_v1 中的对应分支，配套增加 prefix_cache_manager.write_all_cache_to_storage / read_cache_from_storage_for_pd。
重构 swap_cache_layout.cu，引入 swap_d2h_kernel / scatter_blocks_kernel 与文件级 staging/指针缓存，针对 cpu_block_ids 连续的常见路径走优化分支。
调整若干调度调试日志（schedule() 内强制 update_metrics(True)、scheduled_reqs 日志移出守卫），更新 examples/cache_storage/run_03b_pd_storage.sh 中 MOONCAKE_GLOBAL_SEGMENT_SIZE。

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
fastdeploy/envs.py	新增 `FD_PD_TRANSFER_VIA_STORAGE` 开关
fastdeploy/output/token_processor.py	P 侧在 `send_first_token` 前把全量 cache 写入存储
fastdeploy/cache_manager/cache_messager.py	Storage 模式下跳过 RDMA 传输直接标记完成
fastdeploy/cache_manager/prefix_cache_manager.py	新增 PD 存储池模式专用的 write/read 流程及 `:partial:N` key
fastdeploy/engine/common_engine.py	D 侧在加入运行队列前同步从存储池读取 cache
fastdeploy/engine/sched/resource_manager_v1.py	finish 路径走新写法 + 调度内增加 verbose metrics/debug 日志
custom_ops/gpu_ops/swap_cache_layout.cu	重构为基于全局缓冲与 kernel 的批量 swap 实现
examples/cache_storage/run_03b_pd_storage.sh	调大示例脚本中的 Mooncake segment size

Comments suppressed due to low confidence (2)

fastdeploy/engine/common_engine.py:1854

这里直接以 self.resource_manager.requests[request_id] 取请求对象，没有任何防御性判断。在 D 流程中，前面 add_prefilled_request 之前的几个分支（first token eos、error_code != 200）已经会走 pre_recycle_resource 然后 continue，但是当存储读取失败时本函数自己又会调用 pre_recycle_resource(request_id) 然后 continue，紧接着外层循环可能再次拿到同一 request_id（例如重试或者错误重发场景），或者上游已经清理过 requests 字典时，就会抛 KeyError 并把整个 _process_prefilled_requests 异常吞掉，导致后续请求处理被阻塞。建议使用 self.resource_manager.requests.get(request_id)，并在为 None 时记录错误并跳过/回收，而不是依赖外层 try/except。

                        request = self.resource_manager.requests[request_id]

fastdeploy/cache_manager/cache_messager.py:695

此处的 log 字符串 "[PD Storage] Skip RDMA transfer, mark as finished, " f"req_id: {task['request_id']}" 实际是两个相邻字面量拼接，第二个才是 f-string。代码功能没问题，但写法容易让读者误以为前半段也是 f-string；建议直接合并为一个 f-string。类似写法在本 PR 的 common_engine.py 1859、1866、1873 行也出现，建议一并统一。

                            logger.info(
                                f"[PD Storage] Skip RDMA transfer, mark as finished, " f"req_id: {task['request_id']}"
                            )

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

fastdeploy/cache_manager/prefix_cache_manager.py:1410

write_all_cache_to_storage 与 read_cache_from_storage_for_pd 两个方法中，partial block 的 key 构造、token_ids 还原、block_size 切分逻辑几乎完全一致（lines 1304–1331 与 1385–1410）。建议抽出一个私有方法（例如 _compute_storage_keys_with_partial(input_token_ids, request)）返回 keys 列表，避免后续修改 hash 算法时两处遗漏导致 P/D 端 key 不一致而读不到 cache。

        # 2. Calculate cache keys using same algorithm as write_all_cache_to_storage
        keys = []
        prefix_block_key = []
        block_size = self.config.cache_config.block_size
        mm_idx = 0

        for i in range(0, len(input_token_ids), block_size):
            block_token_ids = input_token_ids[i : i + block_size]
            actual_token_num = len(block_token_ids)

            if actual_token_num < block_size:
                key = get_hash_str(block_token_ids, prefix_block_key)
                key = f"{key}:partial:{actual_token_num}"
                keys.append(key)
            else:
                mm_idx, extra_keys = self.get_block_hash_extra_keys(
                    request=request,
                    start_idx=i,
                    end_idx=i + block_size,
                    mm_idx=mm_idx,
                )
                prefix_block_key.extend(extra_keys)
                key = get_hash_str(block_token_ids, prefix_block_key)
                keys.append(key)

            prefix_block_key = [key]

codecov-commenter · 2026-05-17T03:10:47Z

Codecov Report

❌ Patch coverage is 31.89655% with 158 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@d71bdda). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/prefix_cache_manager.py	20.00%	76 Missing and 4 partials ⚠️
fastdeploy/output/token_processor.py	41.17%	25 Missing and 5 partials ⚠️
fastdeploy/engine/common_engine.py	0.00%	15 Missing ⚠️
fastdeploy/cache_manager/cache_transfer_manager.py	72.97%	7 Missing and 3 partials ⚠️
...executor/layers/attention/mla_attention_backend.py	16.66%	5 Missing ⚠️
fastdeploy/engine/args_utils.py	0.00%	2 Missing and 1 partial ⚠️
fastdeploy/engine/sched/resource_manager_v1.py	25.00%	0 Missing and 3 partials ⚠️
...l_executor/layers/attention/append_attn_backend.py	0.00%	1 Missing and 2 partials ⚠️
...executor/layers/attention/dsa_attention_backend.py	25.00%	3 Missing ⚠️
...el_executor/layers/attention/flash_attn_backend.py	0.00%	1 Missing and 2 partials ⚠️
... and 1 more

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7839   +/-   ##
==============================================
  Coverage               ?   72.30%           
==============================================
  Files                  ?      381           
  Lines                  ?    54317           
  Branches               ?     8494           
==============================================
  Hits                   ?    39274           
  Misses                 ?    12274           
  Partials               ?     2769

Flag	Coverage Δ
GPU	`72.30% <31.89%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-17T03:35:16Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 12:22:39

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 2eedbec
Merge base: d71bdda (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

存在 1 个 Required 失败任务需要优先处理（Approval 审批未通过）。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
18(0)	18	14	3	0	1	0

2 任务状态汇总

2.1 Required任务 : 3/4 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	9s	PR问题：缺少RD审批，修改envs.py需批准且Cherry-Pick要求未满足	请FastDeploy RD对PR进行Approve	Job	-
✅	其余 3 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 11/14 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	21m51s	Job	-
❌	`Trigger Jenkins for PR`	5m39s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 11 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码规范/审批不通过（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范（PR 审批未通过）
置信度: 高
根因摘要: 脚本检测到2个审批错误，修改envs.py需RD批准且Cherry-Pick要求未满足
分析器: 通用分析(fallback)

根因详情:
脚本 scripts/check_approval.sh 检测到 2 个审批错误（exit code=6）。错误1：PR 修改了 fastdeploy/envs.py，必须有 FastDeploy RD（jiangjiajun、liuyuanle、chenjian26 或 wanglongzhi）中至少一位的 Approve。错误2：Cherry-Pick PR 需标题包含 [Cherry-Pick] 及原始 develop PR 编号，并需 qingqing01、jiangjiajun 或 dengkaipeng 的审批。

关键日志:

==> PR title: [PD] PD send cache via storage & Refine swap_cache_layout op
0. You must have one FastDeploy RD approval for modifying [fastdeploy/envs.py].
1. Cherry-Pick PR must come from develop and title must contain [Cherry-Pick]...
There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 jiangjiajun、liuyuanle、chenjian26 或 wanglongzhi 中至少一位对 PR 进行 Approve
如此 PR 为 Cherry-Pick，需在标题中加入 [Cherry-Pick] 和原始 PR 编号，并请 qingqing01、jiangjiajun 或 dengkaipeng 进行 Approve

修复建议摘要: 请FastDeploy RD进行Approve（共需满足2项审批要求）

链接: 查看日志

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 17:46:07

📋 Review 摘要

PR 概述：新增 PD 分离 Storage Pool 传输模式（通过 Mooncake 存储解耦 P/D 直连），并重构 swap_cache_layout CUDA 算子（连续 Block 合并 DMA + Scatter Kernel 优化）

变更范围：custom_ops/gpu_ops/、fastdeploy/cache_manager/、fastdeploy/engine/、fastdeploy/envs.py、fastdeploy/model_executor/layers/attention/

影响面 Tag：[OP] [KVCache] [Engine] [PD Disaggregation]

问题

级别	文件	概述
🔴 Bug	`custom_ops/gpu_ops/swap_cache_layout.cu:76`	三个全局 CUDA 缓冲区（`g_staging_buffer` 等）只增不减，进程退出前无 `cudaFree`，符合 §C 显存泄漏信号
🟡 建议	`custom_ops/gpu_ops/swap_cache_layout.cu:32`	`num_vec_per_layer` 使用整除，`block_stride * sizeof(T)` 不能被 16 整除时尾部数据静默丢失
❓ 疑问	`fastdeploy/cache_manager/cache_transfer_manager.py:1302`	`swap_to_storage_barrier.reset()` 被移除，需确认 Barrier 实现是否自动重置，否则并发请求可能死锁

📝 PR 规范检查

存在两处标题不合规：

使用 [PD]（非官方 Tag），官方对应 Tag 为 [PD Disaggregation]
目标分支为 release/2.6（非 develop），必须使用 [Cherry-Pick] 格式；Checklist 第 5 条（cherry-pick）未勾选，且未在描述中注明 develop PR 编号

标题建议（可直接复制）：

[Cherry-Pick][PD Disaggregation] PD send cache via storage & Refine swap_cache_layout op (#原始PR号)

注：#原始PR号 请替换为已合入 develop 分支的对应 PR 编号。

总体评价

Storage Pool 模式整体方案设计合理，TP 多 rank 结果聚合（min(saved_results, key=len)）逻辑清晰，错误处理链路也有明显改善。但 CUDA 侧引入了三个永不释放的全局缓冲区，在 GPU 显存紧张的生产环境会造成显存碎片，属于 §C 必报项，需修复后合入。

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Copilot AI review requested due to automatic review settings May 17, 2026 01:07

juncaipeng had a problem deploying to Metax_ci May 17, 2026 01:07 — with GitHub Actions Failure

Copilot started reviewing on behalf of juncaipeng May 17, 2026 01:08 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

juncaipeng force-pushed the release/2.6-0514 branch from 8aa93f4 to 2c1334e Compare May 17, 2026 01:12

juncaipeng had a problem deploying to Metax_ci May 17, 2026 01:12 — with GitHub Actions Failure