[bugfix] free blocks even if AS write failed by zccjjj · Pull Request #7807 · PaddlePaddle/FastDeploy

zccjjj · 2026-05-13T12:59:20Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-13T12:59:26Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-13T13:24:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-20 08:23:07

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 8a35db7
Merge base: 12c6ae0 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

❌ 存在 1 个 Required 失败任务，阻塞合并，需优先处理。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
70(25)	45	39	3	0	0	3

2 任务状态汇总

2.1 Required任务 : 9/12 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h22m	PR问题：PR新增代码无测试覆盖，行覆盖率未达80%阈值	为新增分支逻辑添加单元测试或申请豁免	Job	🔄×1
✅	其余 9 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 30/33 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	13s	Job	-
❌	`CI_HPU`	1h4m	Job	-
✅	其余 30 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 代码覆盖率
置信度: 高
根因摘要: PR新增代码无测试覆盖，行覆盖率未达80%阈值
分析器: ci_analyze_unittest_fastdeploy

失败步骤: Verify Code Coverage Threshold (80%) (exit code 9)

根因详情:
本次 PR 在 prefix_cache_manager.py 和 resource_manager_v1.py 两个文件中共新增约 103 行代码（净增 57 行），包含大量新增分支逻辑（skipped_nodes 回收、can_recycle_gpu_block_ids 复用子节点路径、多处 try/except 包裹），但未同步添加对应的单元测试。CI 单元测试执行步骤本身通过，仅覆盖率检查步骤（阈值 80%）失败，退出码 9。

修复建议:

为 fastdeploy/cache_manager/prefix_cache_manager.py 中新增的 skipped_nodes 回收逻辑（约 L1511-L1516）和 mm_build_path/build_path 中的 if hash_value in node.children 复用分支添加单元测试
为 fastdeploy/engine/sched/resource_manager_v1.py 中 _trigger_preempt（L440-L450）及 finish_requests（L1722-L1729）新增的异常处理分支添加单元测试（可通过 mock write_cache_to_storage 抛异常来覆盖）
如新增代码确实难以在单测中覆盖，可在 PR 中说明原因，申请覆盖率豁免

修复建议摘要: 为新增分支逻辑添加单元测试或申请豁免

codecov-commenter · 2026-05-13T14:45:43Z

Codecov Report

❌ Patch coverage is 46.15385% with 35 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@12c6ae0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/prefix_cache_manager.py	33.33%	29 Missing and 5 partials ⚠️
fastdeploy/engine/sched/resource_manager_v1.py	92.85%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7807   +/-   ##
==========================================
  Coverage           ?   63.59%           
==========================================
  Files              ?      462           
  Lines              ?    64350           
  Branches           ?     9865           
==========================================
  Hits               ?    40926           
  Misses             ?    20630           
  Partials           ?     2794

Flag	Coverage Δ
GPU	`72.75% <46.15%> (?)`
XPU	`7.12% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-17T15:20:29Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-17 23:18:01

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: b56a7c8
Merge base: 12c6ae0 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 2 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	36	6	0	0	0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h20m	PR问题：prefix_cache_manager.py 新增代码 diff 覆盖率 76%，低于 80% 阈值	为 L1485/1491/1516-1518 的 skipped_nodes 逻辑添加单元测试	Job	-
❌	`Run Base Tests / base_tests`	12m31s	PR问题：test_waiting_time count_200=1005 < 1024，block 管理变更影响并发成功率	检查 skipped_nodes 改动对 block 分配的影响或调整测试阈值	Job	-
✅	其余 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 28/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	10m29s	Job	-
❌	`Check PR Template`	10s	Job	-
❌	`CI_HPU`	1h7m	Job	-
❌	`Trigger Jenkins for PR`	15m11s	Job	-
✅	其余 28 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 覆盖率不达标
置信度: 高
根因摘要: prefix_cache_manager.py 新增 7 行代码，5 行未覆盖，diff 覆盖率 76% 低于 80% 阈值
分析器: ci_analyze_unittest_fastdeploy

根因详情:
PR 在 fastdeploy/cache_manager/prefix_cache_manager.py 新增了 skipped_nodes 相关逻辑（共 7 行新增代码），但其中仅 2 行被现有测试覆盖，第 1485、1491、1516、1517、1518 行未被任何测试执行。diff 覆盖率报告显示 prefix_cache_manager.py 仅有 28.57% 覆盖率，综合 diff 覆盖率为 76%，低于 CI 要求的 80% 阈值，触发 exit code 9。

关键日志:

COVERAGE_EXIT_CODE: 9
{"fastdeploy/cache_manager/prefix_cache_manager.py": {
  "percent_covered": 28.57,
  "violation_lines": [1485, 1491, 1516, 1517, 1518]},
 "total_percent_covered": 76,
 "total_num_violations": 5}
##[error]Process completed with exit code 9.

修复建议:

在对应测试文件中为 free_block_ids_async 的 skipped_nodes 逻辑添加单元测试，覆盖 shared_count > 0 时节点被回填堆的场景（覆盖 prefix_cache_manager.py 第 1485、1491、1516-1518 行）
若无法立即补充测试，可在 CI 配置中为该文件申请覆盖率豁免

修复建议摘要: 为 prefix_cache_manager.py L1485-1518 的 skipped_nodes 逻辑补充单元测试

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py 新增 skipped_nodes 列表及回填逻辑（+13 行）
链接: 查看日志

Run Base Tests / base_tests — 测试失败（置信度: 中）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 测试失败
置信度: 中
根因摘要: test_waiting_time count_200=1005 < 1024，疑为 block 管理变更影响并发请求调度
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_max_waiting_time.py::test_waiting_time`	AssertionError: count_200=1005 < 1024	skipped_nodes 回填逻辑改变 block 分配时序

根因详情:
test_waiting_time 并发发送请求并统计 HTTP 200 响应数，期望 >= 1024，实际仅 1005（差 19 个，98.1% 成功率）。PR 修改了 prefix_cache_manager.py::free_block_ids_async 中 skipped_nodes 逻辑——原先 shared_count > 0 的节点会从堆中"永久丢失"，现在被回填到堆中。此行为变化影响重并发场景下的 block 分配时序，可能导致更多请求因资源竞争未能在等待超时内完成。置信度为"中"，因 1005 vs 1024 非常接近，不排除环境波动所致。

关键日志:

assert count_200 >= 1024, f"200 数量错误，应大于等于1024，实际={count_200}"
AssertionError: 200 数量错误，应大于等于1024，实际=1005
assert 1005 >= 1024
test_max_waiting_time.py:62: AssertionError
FAILED test_max_waiting_time.py::test_waiting_time - 1 failed in 37.09s

修复建议:

检查 prefix_cache_manager.py::free_block_ids_async 的 skipped_nodes 回填逻辑，确认新行为下并发 block 调度是否符合预期；若 block 分配行为发生了实质性变化，需评估对 test_max_waiting_time.py 中阈值 1024 的影响
若确认为 block 管理的合理变更，更新 test_max_waiting_time.py:62 的断言阈值；若为环境波动，尝试 rerun

修复建议摘要: 确认 skipped_nodes 回填对 block 分配的影响，或尝试 rerun

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py L1511-1516 新增 skipped_nodes 回填逻辑
链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 16:35:11

📋 Review 摘要

PR 概述：修复 AS 写入失败时 block 未被释放的 bug，同时修复 prefix cache 路径构建中子节点被覆盖及 LRU heap 节点永久泄漏问题
变更范围：fastdeploy/cache_manager/prefix_cache_manager.py、fastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag：[KVCache] [Scheduler]

问题

级别	文件	概述
🟡 建议	`prefix_cache_manager.py:2036`	`heapq.heapify` 在 `mm_build_path` 循环内调用，O(n) 重复，建议改为标志位延迟到循环后一次性调用
🟡 建议	`prefix_cache_manager.py:2134`	`build_path` 中同样问题，`heapq.heapify` 在循环内重复调用
📝 PR 规范	—	标题 Tag 大小写不符规范；描述各 section 均未填写

📝 PR 规范检查

标题使用 [bugfix]，应为官方规范 Tag [BugFix]（首字母大写）；PR 描述中 Motivation、Modifications、Usage or Command、Accuracy Tests 各 section 均为空，未填写实际内容，Checklist 均未勾选。

标题建议（可直接复制）：

[BugFix] free blocks even if AS write failed

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
修复三类 Bug：
1. 当 `write_cache_to_storage` / `write_cache_to_storage_decode` 抛出异常时，`_free_blocks` 未被调用，导致 block 永久泄漏；
2. 在 `build_path` / `mm_build_path` 构建 radix tree 路径时，若 hash 值相同的子节点已存在，原代码直接覆盖创建新节点，导致已有节点的 shared_count / req_id_set 等状态丢失；
3. `free_block_ids_async` 中 `shared_count > 0` 的节点被从 LRU heap 弹出后 `continue` 跳过，节点永久丢失无法在 shared_count 归零后被驱逐。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`：在 `_trigger_preempt` 和 `finish_requests` 中对 `write_cache_to_storage*` 调用加 try/except，确保写入失败时仍调用 `_free_blocks`
- `fastdeploy/cache_manager/prefix_cache_manager.py`：
  - `free_block_ids_async`：用 `skipped_nodes` 列表暂存 `shared_count > 0` 的跳过节点，循环结束后将其重新放回 heap/set
  - `mm_build_path` / `build_path`：对 hash 命中的已有子节点执行 `increment_shared_count` 并复用，同时将多余已分配 GPU block 加入 `can_recycle_gpu_block_ids` 回收

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

三处 Bug 修复逻辑正确：AS 写入异常隔离、skipped_nodes 回归 heap、已有子节点复用而非覆盖，整体方向合理。建议优化 build_path / mm_build_path 中 heapq.heapify 在循环内的低效调用，并补全 PR 描述。

PaddlePaddle-bot · 2026-05-18T08:38:33Z

+                    if child in self.gpu_lru_leaf_set:
+                        self.gpu_lru_leaf_set.remove(child)
+                        self.gpu_lru_leaf_heap.remove(child)
+                        heapq.heapify(self.gpu_lru_leaf_heap)


🟡 建议 heapq.heapify 在 for 循环内部每次匹配到已有子节点时都被调用，时间复杂度为 O(n) × 循环次数。

cache_output_blocks 中采用了更高效的标志位延迟方案（见 L1923-1952），建议此处对齐：

has_modified_heap = False # ... 循环内部: if child in self.gpu_lru_leaf_set: self.gpu_lru_leaf_set.remove(child) self.gpu_lru_leaf_heap.remove(child) has_modified_heap = True # 循环结束后一次性 heapify: if has_modified_heap: heapq.heapify(self.gpu_lru_leaf_heap)

PaddlePaddle-bot · 2026-05-18T08:38:33Z

+                    if child in self.gpu_lru_leaf_set:
+                        self.gpu_lru_leaf_set.remove(child)
+                        self.gpu_lru_leaf_heap.remove(child)
+                        heapq.heapify(self.gpu_lru_leaf_heap)


🟡 建议 同 mm_build_path：heapq.heapify 在循环内被重复调用。建议改为标志位延迟到循环后一次性调用，降低时间复杂度。

zccjjj had a problem deploying to Metax_ci May 13, 2026 12:59 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

zccjjj had a problem deploying to Metax_ci May 13, 2026 13:31 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

zccjjj had a problem deploying to Metax_ci May 14, 2026 06:53 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

zccjjj added 4 commits May 17, 2026 01:17

[bugfix] free blocks even if AS write failed

60a5eeb

update

50db4bf

update preemption bugfix

ace130f

[bugfix] block leaks

b56a7c8

zccjjj force-pushed the ASbugfix branch from 0303371 to b56a7c8 Compare May 16, 2026 17:19

zccjjj had a problem deploying to Metax_ci May 16, 2026 17:20 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

bugfix block leaks because of storage prefetch blocks rebuild path

8a35db7

zccjjj temporarily deployed to Metax_ci May 18, 2026 08:26 — with GitHub Actions Inactive

PaddlePaddle-bot reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] free blocks even if AS write failed#7807

[bugfix] free blocks even if AS write failed#7807
zccjjj wants to merge 5 commits into
PaddlePaddle:developfrom
zccjjj:ASbugfix

zccjjj commented May 13, 2026

Uh oh!

paddle-bot Bot commented May 13, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 13, 2026 •

edited

Loading

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 17, 2026

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Run Base Tests / base_tests

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 18, 2026

Uh oh!

PaddlePaddle-bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zccjjj commented May 13, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 13, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 9/12 通过

2.2 可选任务 — 30/33 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 17, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/10 通过

2.2 可选任务 — 28/32 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Run Base Tests / base_tests

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented May 13, 2026 •

edited

Loading

codecov-commenter commented May 13, 2026 •

edited

Loading