[Cherry-Pick][CI] Sync dev optimizations to release/online/20260415(#7602)#7857
Conversation
|
/skip-ci ci_iluvatar |
|
Thanks for your contribution! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/online/20260415 #7857 +/- ##
==========================================================
Coverage ? 73.40%
==========================================================
Files ? 383
Lines ? 53782
Branches ? 8437
==========================================================
Hits ? 39478
Misses ? 11615
Partials ? 2689
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-19 20:19:57
📋 Review 摘要
PR 概述:Cherry-pick CI 优化到 release/online/20260415 分支,包含 Paddle 版本锁定、OOM 重试机制、pytest 失败日志收集、git diff 策略修复及 PR 模板检查优化。
变更范围:.github/workflows/(14 个工作流文件)、scripts/(5 个脚本)、tests/conftest.py
影响面 Tag:[CI]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | scripts/coverage_run.sh:79 |
OOM 重试间隔 sleep 5 在高负载 CI 环境可能不足以释放 GPU 显存 |
| ❓ 疑问 | .github/workflows/_build_linux_cu129.yml:186 |
3.5.0.dev20260508 是否在 cu129/cu130 nightly 索引中均有对应 wheel 包 |
📝 PR 规范检查
标题格式 [Cherry-Pick][CI] 符合 Cherry-Pick 规范,包含官方 Tag [CI],且引用了原 PR 号。PR 描述包含全部必填 section(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)且内容充实。✓
总体评价
本 PR 是 CI 维护性改动的合规 cherry-pick,涵盖 Paddle 版本稳定化、OOM 重试机制、日志采集增强及 git diff 策略修复,改动方向正确,代码逻辑清晰。建议在合入前确认版本 3.5.0.dev20260508 的 wheel 包在 cu129、cu130 及 xpu-p800 各自的 nightly 索引中确实可用,避免对应流水线因找不到包而立即失败。
| echo "==================== Retrying (${retry_count}/${max_retries}) ====================" | ||
| echo "Previous attempt was Killed, retrying..." | ||
| # Clean up before retry | ||
| sleep 5 # Wait a bit to let resources be released |
There was a problem hiding this comment.
🟡 建议 OOM 重试间隔可能不足以释放 GPU 显存
OOM/SIGKILL 触发后,GPU 显存由 CUDA driver 异步清理,在 CI 机器多任务并发时 5 秒往往不够,可能导致下一次重试仍然触发 OOM。
建议将等待时间延长(如 30 秒),或在重试前检测 GPU 显存使用量:
sleep 30 # 给 CUDA driver 更多时间释放显存| python -m pip install paddlepaddle-gpu==${PADDLEVERSION} -i https://www.paddlepaddle.org.cn/packages/stable/cu129/ | ||
| else | ||
| python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu129/ | ||
| python -m pip install paddlepaddle-gpu==3.5.0.dev20260508 -i https://www.paddlepaddle.org.cn/packages/nightly/cu129/ |
There was a problem hiding this comment.
❓ 疑问 cu129/cu130 nightly 索引中该日期的包是否存在?
所有 CUDA 版本(cu126/cu129/cu130)和 XPU 均锁定到同一版本 3.5.0.dev20260508,但 nightly 包并不保证每个 CUDA 变体都有相同日期的 build。若 cu129 或 cu130 索引中缺少该日期包,对应流水线将立即失败。
建议在合入前验证:
pip index versions paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu129/ 2>&1 | grep 3.5.0.dev20260508
pip index versions paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu130/ 2>&1 | grep 3.5.0.dev20260508
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 5/8 通过
2.2 可选任务 — 21/23 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 更新测试 mock 函数签名以匹配新增参数,检查 dtype 兼容性 关联变更: 本 PR 将 Paddle 版本锁定至 链接: 查看日志 xpu_4cards_case_test / run_xpu_4cards_cases — 服务启动失败(置信度: 高)xpu_4cards_case_test / run_xpu_4cards_cases
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 验证 paddlepaddle-xpu==3.5.0.dev20260508 与 CI XPU 机器的兼容性 关联变更: 链接: 查看日志 xpu_8cards_case_test / run_xpu_8cards_cases — 服务运行时异常(置信度: 高)xpu_8cards_case_test / run_xpu_8cards_cases
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 查看 XPU case_logs 获取 PD 分离服务端 500 错误堆栈 关联变更: 链接: 查看日志 |
55eb3a6
into
PaddlePaddle:release/online/20260415
Motivation
Modifications
Cherry-pick of #7760 #7602 #7601 and #7405 to
release/online/20260415.Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.