Skip to content

[Cherry-Pick][Feature][Log]console metrics log for pd disaggregation #7843#7845

Open
CSWYF3634076 wants to merge 1 commit into
PaddlePaddle:release/2.6from
CSWYF3634076:release/2.6-console-log
Open

[Cherry-Pick][Feature][Log]console metrics log for pd disaggregation #7843#7845
CSWYF3634076 wants to merge 1 commit into
PaddlePaddle:release/2.6from
CSWYF3634076:release/2.6-console-log

Conversation

@CSWYF3634076
Copy link
Copy Markdown
Collaborator

@CSWYF3634076 CSWYF3634076 commented May 18, 2026

Motivation

Cherry-pick from #7843
Fix the issue where node D prints prefill logs in the PD disaggregation

Modifications

  • scheduler_metrics_logger.py:将 log_prefill_batch 重构为私有方法 _log_prefill_like_batch,新增公开方法 log_prefill_batch(包装器)和 log_decode_bootstrap_batch(Decode 节点 bootstrap 日志);日志消息中增加 splitwise_role 字段;SchedulerMetricsLogger.__init__ 新增 splitwise_role 参数,默认值 "mixed"
  • resource_manager_v1.py:在 _log_console_scheduler_metrics 中,当 splitwise_role == "decode" 时调用 log_decode_bootstrap_batch,否则调用原有 log_prefill_batch,避免 Decode 节点的日志误标为 "Prefill batch"。
  • common_engine.py:构造 SchedulerMetricsLogger 时传入 splitwise_role
  • 补充单测:新增 test_log_decode_bootstrap_batch_logs_expected_messagetest_decode_role_prefill_task_logs_decode_bootstrap_batchtest_default_splitwise_role_is_mixed

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 20:42:55

📋 Review 摘要

PR 概述:为 PD 分离场景(Splitwise)中的 Decode 节点添加独立的 console 指标日志,解决 Decode 节点误用 Prefill 日志格式的问题。
变更范围fastdeploy/engine/(common_engine.py、sched/resource_manager_v1.py、sched/scheduler_metrics_logger.py)及对应测试
影响面 Tag[Engine] [Scheduler] [PD Disaggregation]

问题

未发现阻塞性问题。

📝 PR 规范检查

标题包含非官方 Tag [Log],且在 [Cherry-Pick] 之后使用了两个 Tag([Feature][Log]),不符合 Cherry-Pick 格式规范(仅允许一个官方 Tag)。

标题建议(可直接复制):

  • [Cherry-Pick][PD Disaggregation] add console metrics log for pd disaggregation (#7843)

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
Cherry-pick from https://github.com/PaddlePaddle/FastDeploy/pull/7843
Fix the issue where node D prints prefill logs in the PD disaggregation

## Modifications
- `scheduler_metrics_logger.py`:将 `log_prefill_batch` 重构为私有方法 `_log_prefill_like_batch`,新增公开方法 `log_prefill_batch`(包装器)和 `log_decode_bootstrap_batch`(Decode 节点 bootstrap 日志);日志消息中增加 `splitwise_role` 字段;`SchedulerMetricsLogger.__init__` 新增 `splitwise_role` 参数,默认值 `"mixed"`- `resource_manager_v1.py`:在 `_log_console_scheduler_metrics` 中,当 `splitwise_role == "decode"` 时调用 `log_decode_bootstrap_batch`,否则调用原有 `log_prefill_batch`,避免 Decode 节点的日志误标为 "Prefill batch"。
- `common_engine.py`:构造 `SchedulerMetricsLogger` 时传入 `splitwise_role`- 补充单测:新增 `test_log_decode_bootstrap_batch_logs_expected_message``test_decode_role_prefill_task_logs_decode_bootstrap_batch``test_default_splitwise_role_is_mixed`## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

代码逻辑清晰,重构合理,测试覆盖全面,Cherry-pick 来源明确。仅标题含非官方 Tag [Log],建议修正后合入。

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 18, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 12:22:38

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

❌ 存在 1 个 Required 任务失败,阻塞合并,需优先处理。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
35(0) 35 31 4 0 0 0

2 任务状态汇总

2.1 Required任务 : 8/9 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h12m PR问题:测试用 BatchRequest 未 import,NameError 在测试文件头部添加 BatchRequest 的 import Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 17m17s Job -
CI_HPU 1h7m Job -
Trigger Jenkins for PR 59s Job -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 高
  • 根因摘要: PR新增测试使用 BatchRequest 类但未 import,导致 NameError
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
tests/v1/test_resource_manager_v1.py::TestResourceManagerV1Additional::test_decode_role_prefill_task_logs_decode_bootstrap_batch NameError: name 'BatchRequest' is not defined 测试第 586 行使用了未导入的 BatchRequest

根因详情:
本次 PR 在 tests/v1/test_resource_manager_v1.py 中新增了测试方法 test_decode_role_prefill_task_logs_decode_bootstrap_batch(第 571~596 行)。该测试在第 586 行调用了 BatchRequest() 并通过 batch_request.add_request(request) 构造批次对象,但 PR diff 中只新增了 RequestTypeenvs 的 import,未引入 BatchRequest 的 import 语句,导致运行时抛出 NameError

关键日志:

>       batch_request = BatchRequest()
E       NameError: name 'BatchRequest' is not defined

tests/v1/test_resource_manager_v1.py:586: NameError

修复建议:

  1. tests/v1/test_resource_manager_v1.py 的 import 区域(约第 39~44 行 from fastdeploy.engine.sched.resource_manager_v1 import (...) 处)添加 BatchRequest 的导入,确认其所在模块后追加 import(如 from fastdeploy.engine.sched.resource_manager_v1 import BatchRequest 或其正确来源模块)。

修复建议摘要: 在测试文件 import 区添加 BatchRequest 导入语句

关联变更: tests/v1/test_resource_manager_v1.py L571-L596(新增测试方法,L586 使用 BatchRequest

链接: 查看日志

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 18, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@9894b32). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/sched/resource_manager_v1.py 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7845   +/-   ##
==============================================
  Coverage               ?   72.46%           
==============================================
  Files                  ?      381           
  Lines                  ?    54162           
  Branches               ?     8461           
==============================================
  Hits                   ?    39246           
  Misses                 ?    12157           
  Partials               ?     2759           
Flag Coverage Δ
GPU 72.46% <90.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 00:12:07

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前存在 2 个 Required 失败任务需要处理,请优先修复。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 31 5 0 0 0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h12m PR问题:测试使用 BatchRequest 未定义导致 NameError 将 BatchRequest() 替换为 [request] 列表传入 L586 Job -
Pre Commit 38s PR问题:flake8/ruff F821 BatchRequest 未定义 修复 test_resource_manager_v1.py:586 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 17m17s Job -
CI_HPU 1h7m Job -
Trigger Jenkins for PR 59s Job -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 高
  • 根因摘要: PR问题:测试使用 BatchRequest 但未定义/未导入,导致 NameError
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
v1/test_resource_manager_v1.py::TestResourceManagerV1Additional::test_decode_role_prefill_task_logs_decode_bootstrap_batch NameError: BatchRequest is not defined PR 新增测试使用了未定义的 BatchRequest
pooling/test_Ernie4_5_reward_serving ConnectionResetError: [Errno 104] engine worker queue 连接被重置,可能为 flaky

根因详情:
PR 在 tests/v1/test_resource_manager_v1.py 中新增了测试方法 test_decode_role_prefill_task_logs_decode_bootstrap_batch,该方法在第 586 行使用了 BatchRequest() 类,但该类在测试文件中既未 import 也不存在于代码库中。实际上 _log_console_scheduler_metrics 的函数签名为 (self, scheduled_reqs: list[Request | ScheduledDecodeTask]),测试应直接传 [request] 列表而非 BatchRequest 对象。

关键日志:

>       batch_request = BatchRequest()
E       NameError: name 'BatchRequest' is not defined

tests/v1/test_resource_manager_v1.py:586: NameError

修复建议:

  1. tests/v1/test_resource_manager_v1.py L585-590:删除 batch_request = BatchRequest()batch_request.add_request(request) 两行,将 manager._log_console_scheduler_metrics(batch_request) 改为 manager._log_console_scheduler_metrics([request])
  2. pooling/test_Ernie4_5_reward_serving 失败为 ConnectionReset,建议 rerun 观察是否为 flaky 问题

关联变更: tests/v1/test_resource_manager_v1.py(PR 新增测试)

链接: 查看日志

Pre Commit — 代码规范(置信度: 高)

Pre Commit

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: PR问题:flake8/ruff F821 检测到 BatchRequest 未定义
  • 分析器: ci_analyze_infra(fallback)

根因详情:
flake8 和 ruff 同时报告 tests/v1/test_resource_manager_v1.py:586:25: F821 undefined name 'BatchRequest'。这与单元测试失败为同一根因——BatchRequest 在测试文件中未定义。修复单元测试后,代码规范检查也将自动通过。

关键日志:

flake8: tests/v1/test_resource_manager_v1.py:586:25: F821 undefined name 'BatchRequest'
ruff:   tests/v1/test_resource_manager_v1.py:586:25: F821 Undefined name `BatchRequest`

修复建议:

  1. 修复 tests/v1/test_resource_manager_v1.py L586 的 BatchRequest 未定义问题(与 run_tests_with_coverage 修复相同),代码规范检查将自动通过

链接: 查看日志

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants