Skip to content

[Cherry-Pick][CI] Sync dev optimizations to release/online/20260415(#7602)#7857

Merged
freeliuzc merged 1 commit into
PaddlePaddle:release/online/20260415from
EmmonsCurse:ci_optimize_online_0415
May 19, 2026
Merged

[Cherry-Pick][CI] Sync dev optimizations to release/online/20260415(#7602)#7857
freeliuzc merged 1 commit into
PaddlePaddle:release/online/20260415from
EmmonsCurse:ci_optimize_online_0415

Conversation

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

Motivation

  1. Some tests may intermittently fail due to OOM or process kill issues, especially under constrained CI resources. Previously, a high-risk OOM test list was maintained to mitigate this, but it increases maintenance overhead. Introducing a retry mechanism provides a more robust and scalable solution to handle transient failures without excluding tests.
  2. Enhance CI debugging capability by collecting detailed pytest failure logs, reducing troubleshooting time for flaky or failed cases.
  3. Using git diff upstream/$BRANCH directly may include unrelated changes when the local branch is not strictly rebased, leading to incorrect detection results in CI checks.
  4. The checklist validation in scripts/CheckPRTemplate.py introduces unnecessary restrictions during PR submission and may block workflow efficiency for contributors.
  5. Recent upstream changes in Paddle [Compat] Use public compat APIs for proxy controls Paddle#78923 introduce instability and unexpected failures in CI. To prevent blocking ongoing development and ensure CI stability, it is necessary to temporarily lock the Paddle version to a known working build.

Modifications

Cherry-pick of #7760 #7602 #7601 and #7405 to release/online/20260415.

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@EmmonsCurse
Copy link
Copy Markdown
Collaborator Author

/skip-ci ci_iluvatar
/skip-ci ci_hpu

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 19, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/online/20260415@18665da). Learn more about missing BASE report.

Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20260415    #7857   +/-   ##
==========================================================
  Coverage                           ?   73.40%           
==========================================================
  Files                              ?      383           
  Lines                              ?    53782           
  Branches                           ?     8437           
==========================================================
  Hits                               ?    39478           
  Misses                             ?    11615           
  Partials                           ?     2689           
Flag Coverage Δ
GPU 73.40% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 20:19:57

📋 Review 摘要

PR 概述:Cherry-pick CI 优化到 release/online/20260415 分支,包含 Paddle 版本锁定、OOM 重试机制、pytest 失败日志收集、git diff 策略修复及 PR 模板检查优化。

变更范围.github/workflows/(14 个工作流文件)、scripts/(5 个脚本)、tests/conftest.py

影响面 Tag[CI]

问题

级别 文件 概述
🟡 建议 scripts/coverage_run.sh:79 OOM 重试间隔 sleep 5 在高负载 CI 环境可能不足以释放 GPU 显存
❓ 疑问 .github/workflows/_build_linux_cu129.yml:186 3.5.0.dev20260508 是否在 cu129/cu130 nightly 索引中均有对应 wheel 包

📝 PR 规范检查

标题格式 [Cherry-Pick][CI] 符合 Cherry-Pick 规范,包含官方 Tag [CI],且引用了原 PR 号。PR 描述包含全部必填 section(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)且内容充实。✓

总体评价

本 PR 是 CI 维护性改动的合规 cherry-pick,涵盖 Paddle 版本稳定化、OOM 重试机制、日志采集增强及 git diff 策略修复,改动方向正确,代码逻辑清晰。建议在合入前确认版本 3.5.0.dev20260508 的 wheel 包在 cu129、cu130 及 xpu-p800 各自的 nightly 索引中确实可用,避免对应流水线因找不到包而立即失败。

Comment thread scripts/coverage_run.sh
echo "==================== Retrying (${retry_count}/${max_retries}) ===================="
echo "Previous attempt was Killed, retrying..."
# Clean up before retry
sleep 5 # Wait a bit to let resources be released
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 OOM 重试间隔可能不足以释放 GPU 显存

OOM/SIGKILL 触发后,GPU 显存由 CUDA driver 异步清理,在 CI 机器多任务并发时 5 秒往往不够,可能导致下一次重试仍然触发 OOM。

建议将等待时间延长(如 30 秒),或在重试前检测 GPU 显存使用量:

sleep 30  # 给 CUDA driver 更多时间释放显存

python -m pip install paddlepaddle-gpu==${PADDLEVERSION} -i https://www.paddlepaddle.org.cn/packages/stable/cu129/
else
python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu129/
python -m pip install paddlepaddle-gpu==3.5.0.dev20260508 -i https://www.paddlepaddle.org.cn/packages/nightly/cu129/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 cu129/cu130 nightly 索引中该日期的包是否存在?

所有 CUDA 版本(cu126/cu129/cu130)和 XPU 均锁定到同一版本 3.5.0.dev20260508,但 nightly 包并不保证每个 CUDA 变体都有相同日期的 build。若 cu129 或 cu130 索引中缺少该日期包,对应流水线将立即失败。

建议在合入前验证:

pip index versions paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu129/ 2>&1 | grep 3.5.0.dev20260508
pip index versions paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu130/ 2>&1 | grep 3.5.0.dev20260508

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 20:48:47

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ 有 3 个 Required 任务失败,需优先处理后才能合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
31(0) 31 26 4 0 0 1

2 任务状态汇总

2.1 Required任务 : 5/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h17m PR问题:4用例失败,Mock接口不兼容+dtype错误 更新测试mock签名,检查int32/int64 dtype Job -
xpu_4cards_case_test / run_xpu_4cards_cases 30m15s PR问题:XPU服务未启动,15用例全连接被拒绝 验证paddlepaddle-xpu==3.5.0.dev20260508兼容性 Job -
xpu_8cards_case_test / run_xpu_8cards_cases 19m45s PR问题:XPU PD分离推理返回500错误 查看case_logs排查XPU PD分离500错误 Job -
其余 5 个必选任务通过 - - - - -

2.2 可选任务 — 21/23 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 16m51s Job -
其余 21 个可选任务通过(含 1 个跳过) - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 高
  • 根因摘要: 4个单测失败:Mock接口缺少 use_fused_cast 参数 + fd_config 属性 + dtype 不匹配
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
layers/test_fused_moe_cutlass_backend.py::test_apply_tp_with_dispatch_and_reduce TypeError: unexpected keyword argument 'use_fused_cast' Mock 未同步新增参数
model_executor/test_ep.py::test_eprunner_moe_select_noaux_tc_without_redundant AttributeError: no attribute 'fd_config' 测试 SimpleNamespace 缺少 fd_config
layers/test_speculative_sampler.py::test_speculative_sampler_logprobs ValueError: int64 vs int32 dtype mismatch Paddle 版本锁定引入 dtype 兼容性问题
input/test_process_stop_token_ids.py 无明确错误信息 可能超时或进程崩溃

根因详情:
生产代码 fused_moe_cutlass_backend.py:400 在调用 get_moe_scores() 时新增了 use_fused_cast 关键字参数,但测试 mock 函数 fake_get_moe_scores 未同步更新签名。ep.py:534 新增访问 layer.fd_config.scheduler_config.enable_moe_scores_elementwise_fuse,而测试中 layer = SimpleNamespace(...) 未包含 fd_config 属性。test_speculative_sampler.py 的 dtype 不匹配(int64 vs int32)出现在 sampler.py:817import_ops.py:79,可能与本 PR 将 Paddle 版本锁定至 3.5.0.dev20260508 引入的底层 API 差异有关。

关键日志:

TypeError: fake_get_moe_scores() got an unexpected keyword argument 'use_fused_cast'
fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py:400

AttributeError: 'types.SimpleNamespace' object has no attribute 'fd_config'
fastdeploy/model_executor/layers/moe/ep.py:534

ValueError: The type of data we are trying to retrieve (int64) does not match
the type of data (int32) currently contained in the container.
fastdeploy/import_ops.py:79

修复建议:

  1. tests/layers/test_fused_moe_cutlass_backend.py:在 fake_get_moe_scores 函数签名末尾添加 use_fused_cast=False 参数,以兼容 fused_moe_cutlass_backend.py:400 的新调用方式
  2. tests/model_executor/test_ep.py:在 layer = SimpleNamespace(...) 中补充 fd_config mock(包含 scheduler_config.enable_moe_scores_elementwise_fuse=False
  3. fastdeploy/model_executor/layers/sample/sampler.py:检查 build_sampling_params 中传入 op 的 tensor 是否需要显式 .cast(paddle.int32);或验证 3.5.0.dev20260508 与该 op 的 dtype 兼容性
  4. tests/input/test_process_stop_token_ids.py:查看完整日志排查崩溃原因

修复建议摘要: 更新测试 mock 函数签名以匹配新增参数,检查 dtype 兼容性

关联变更: 本 PR 将 Paddle 版本锁定至 paddlepaddle-gpu==3.5.0.dev20260508.github/workflows/_unit_test_coverage.yml),与 test_speculative_sampler.py dtype 问题可能相关;测试 mock 不兼容为代码库中已有接口变更未同步问题。

链接: 查看日志

xpu_4cards_case_test / run_xpu_4cards_cases — 服务启动失败(置信度: 高)

xpu_4cards_case_test / run_xpu_4cards_cases

  • 状态: ❌ 失败
  • 错误类型: 基础设施/服务启动失败
  • 置信度: 高
  • 根因摘要: XPU 推理服务未能启动,15个用例全部连接被拒绝
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
test_ep4tp4_all2all / test_w4a8 等 15 个用例 openai.APIConnectionError: Connection error / ConnectError: [Errno 111] Connection refused FastDeploy 服务端口未就绪

根因详情:
全部 15 个 XPU 4卡测试用例均以连接拒绝失败,这是服务进程未能成功启动的典型特征。本 PR 将 paddlepaddle-xpu--pre(最新 nightly)锁定至 3.5.0.dev20260508.github/workflows/_xpu_4cards_case_test.yml),若该特定版本在 CI XPU 机器上存在缺少特定算子(如 moe_permute)或 ABI 不兼容等问题,会导致 FastDeploy 启动时崩溃退出,所有测试的连接请求均被拒绝。

关键日志:

openai.APIConnectionError: Connection error.
httpcore.ConnectError: [Errno 111] Connection refused
FAILED tests/xpu_ci/4cards_cases/test_ep4tp4_all2all.py::test_ep4tp4_all2all
FAILED tests/xpu_ci/4cards_cases/test_w4a8.py::test_w4a8 - Failed: W4A8测试失败
...(共 15 个用例全部失败)

修复建议:

  1. 查看 XPU 4-card artifact 日志(xpu-4cards-case-logs),确认服务进程启动日志和崩溃堆栈
  2. 验证 paddlepaddle-xpu==3.5.0.dev20260508 是否在 CI XPU (P800) 机器上可正常安装和运行
  3. 若版本不兼容,考虑锁定到在 XPU 机器上已验证的版本,或回退到 --pre

修复建议摘要: 验证 paddlepaddle-xpu==3.5.0.dev20260508 与 CI XPU 机器的兼容性

关联变更: .github/workflows/_xpu_4cards_case_test.ymlpaddlepaddle-xpu --pre 改为 ==3.5.0.dev20260508

链接: 查看日志

xpu_8cards_case_test / run_xpu_8cards_cases — 服务运行时异常(置信度: 高)

xpu_8cards_case_test / run_xpu_8cards_cases

  • 状态: ❌ 失败
  • 错误类型: 服务运行时异常
  • 置信度: 高
  • 根因摘要: XPU 8卡 PD 分离场景推理请求返回 HTTP 500 内部错误
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
test_pd_21b_ep4tp1::test_pd_separation openai.InternalServerError: Internal Server Error PD 分离服务返回 500
test_pd_21b_ep4tp4::test_pd_separation openai.InternalServerError: Internal Server Error PD 分离服务返回 500
test_pd_21b_ep4tp4_cudagraph::test_pd_separation openai.InternalServerError: Internal Server Error PD 分离服务返回 500
test_pd_p_tp4ep4_d_tp1ep4::test_pd_separation openai.InternalServerError: Internal Server Error PD 分离服务返回 500

根因详情:
XPU 8卡测试中服务已正常启动(非连接拒绝),但 4 个 PD(Prefill-Decode)分离用例在发送推理请求时均收到 HTTP 500 错误。日志显示服务在初始化 KV Cache 和 BKCL/RDMA 通信等阶段完成,说明启动过程无误。500 错误发生在推理阶段,可能与 3.5.0.dev20260508 版本在 XPU PD 分离模式下的算子行为变化有关。

关键日志:

openai.InternalServerError: Internal Server Error
tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py:292: in test_pd_separation
E   Failed: PD分离测试失败: Internal Server Error
======================== 4 failed in 985.68s (0:16:25) =========================

修复建议:

  1. 查看 XPU 8-card artifact 日志(xpu-8cards-case-logs),获取 PD 分离服务端 500 错误的详细 Python 堆栈
  2. 确认 paddlepaddle-xpu==3.5.0.dev20260508 对 XPU PD 分离(EP+TP 混合并行)场景是否有已知兼容性问题
  3. 若为 Paddle 版本问题,考虑单独为 XPU 任务使用经过验证的版本

修复建议摘要: 查看 XPU case_logs 获取 PD 分离服务端 500 错误堆栈

关联变更: .github/workflows/_xpu_8cards_case_test.ymlpaddlepaddle-xpu --pre 改为 ==3.5.0.dev20260508

链接: 查看日志

@freeliuzc freeliuzc merged commit 55eb3a6 into PaddlePaddle:release/online/20260415 May 19, 2026
28 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants