[bugfix] free blocks even if AS write failed#7807
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览❌ 存在 1 个 Required 失败任务,阻塞合并,需优先处理。
2 任务状态汇总2.1 Required任务 : 9/12 通过
2.2 可选任务 — 30/33 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败步骤: 根因详情: 修复建议:
修复建议摘要: 为新增分支逻辑添加单元测试或申请豁免 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7807 +/- ##
==========================================
Coverage ? 63.59%
==========================================
Files ? 462
Lines ? 64350
Branches ? 9865
==========================================
Hits ? 40926
Misses ? 20630
Partials ? 2794
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 2 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 8/10 通过
2.2 可选任务 — 28/32 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
根因详情: 关键日志: 修复建议:
修复建议摘要: 为 prefix_cache_manager.py L1485-1518 的 skipped_nodes 逻辑补充单元测试 关联变更: Run Base Tests / base_tests — 测试失败(置信度: 中)Run Base Tests / base_tests
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 确认 skipped_nodes 回填对 block 分配的影响,或尝试 rerun 关联变更: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-18 16:35:11
📋 Review 摘要
PR 概述:修复 AS 写入失败时 block 未被释放的 bug,同时修复 prefix cache 路径构建中子节点被覆盖及 LRU heap 节点永久泄漏问题
变更范围:fastdeploy/cache_manager/prefix_cache_manager.py、fastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag:[KVCache] [Scheduler]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | prefix_cache_manager.py:2036 |
heapq.heapify 在 mm_build_path 循环内调用,O(n) 重复,建议改为标志位延迟到循环后一次性调用 |
| 🟡 建议 | prefix_cache_manager.py:2134 |
build_path 中同样问题,heapq.heapify 在循环内重复调用 |
| 📝 PR 规范 | — | 标题 Tag 大小写不符规范;描述各 section 均未填写 |
📝 PR 规范检查
标题使用 [bugfix],应为官方规范 Tag [BugFix](首字母大写);PR 描述中 Motivation、Modifications、Usage or Command、Accuracy Tests 各 section 均为空,未填写实际内容,Checklist 均未勾选。
标题建议(可直接复制):
[BugFix] free blocks even if AS write failed
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
修复三类 Bug:
1. 当 `write_cache_to_storage` / `write_cache_to_storage_decode` 抛出异常时,`_free_blocks` 未被调用,导致 block 永久泄漏;
2. 在 `build_path` / `mm_build_path` 构建 radix tree 路径时,若 hash 值相同的子节点已存在,原代码直接覆盖创建新节点,导致已有节点的 shared_count / req_id_set 等状态丢失;
3. `free_block_ids_async` 中 `shared_count > 0` 的节点被从 LRU heap 弹出后 `continue` 跳过,节点永久丢失无法在 shared_count 归零后被驱逐。
## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`:在 `_trigger_preempt` 和 `finish_requests` 中对 `write_cache_to_storage*` 调用加 try/except,确保写入失败时仍调用 `_free_blocks`
- `fastdeploy/cache_manager/prefix_cache_manager.py`:
- `free_block_ids_async`:用 `skipped_nodes` 列表暂存 `shared_count > 0` 的跳过节点,循环结束后将其重新放回 heap/set
- `mm_build_path` / `build_path`:对 hash 命中的已有子节点执行 `increment_shared_count` 并复用,同时将多余已分配 GPU block 加入 `can_recycle_gpu_block_ids` 回收
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
三处 Bug 修复逻辑正确:AS 写入异常隔离、skipped_nodes 回归 heap、已有子节点复用而非覆盖,整体方向合理。建议优化 build_path / mm_build_path 中 heapq.heapify 在循环内的低效调用,并补全 PR 描述。
| if child in self.gpu_lru_leaf_set: | ||
| self.gpu_lru_leaf_set.remove(child) | ||
| self.gpu_lru_leaf_heap.remove(child) | ||
| heapq.heapify(self.gpu_lru_leaf_heap) |
There was a problem hiding this comment.
🟡 建议 heapq.heapify 在 for 循环内部每次匹配到已有子节点时都被调用,时间复杂度为 O(n) × 循环次数。
cache_output_blocks 中采用了更高效的标志位延迟方案(见 L1923-1952),建议此处对齐:
has_modified_heap = False
# ... 循环内部:
if child in self.gpu_lru_leaf_set:
self.gpu_lru_leaf_set.remove(child)
self.gpu_lru_leaf_heap.remove(child)
has_modified_heap = True
# 循环结束后一次性 heapify:
if has_modified_heap:
heapq.heapify(self.gpu_lru_leaf_heap)| if child in self.gpu_lru_leaf_set: | ||
| self.gpu_lru_leaf_set.remove(child) | ||
| self.gpu_lru_leaf_heap.remove(child) | ||
| heapq.heapify(self.gpu_lru_leaf_heap) |
There was a problem hiding this comment.
🟡 建议 同 mm_build_path:heapq.heapify 在循环内被重复调用。建议改为标志位延迟到循环后一次性调用,降低时间复杂度。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.