[Feature] Add server-level token limits and prompt truncation control by luukunn · Pull Request #7842 · PaddlePaddle/FastDeploy

luukunn · 2026-05-18T06:33:40Z

Motivation

本 PR 为服务端新增了统一的长度参数默认值配置能力，使用户在未显式传入请求级参数时，也可以通过服务级配置控制生成长度相关行为；同时新增了输入 token 长度限制，用于提前拦截超长请求。

Modifications

新增服务级长度控制配置 ServingLimitsConfig，并挂载到 FDConfig 中统一管理。
在 CLI / 配置项中新增以下服务级参数：
- max_completion_tokens
- reasoning_max_tokens
- response_max_tokens
- min_completion_tokens
- input_max_tokens
- truncate_prompt_tokens
在 async_llm、common_engine、engine_client 初始化阶段，将服务级默认长度配置注入 data_processor。
更新文本与多模态请求处理逻辑：
- 当请求未显式指定 max_tokens 时，默认使用服务级 max_completion_tokens，并受剩余上下文长度约束；
- 当请求显式指定 max_tokens 时，会同时受服务级上限和上下文剩余长度限制；
- reasoning_max_tokens / response_max_tokens 会被约束为不超过最终生效的 max_tokens；
- min_tokens 采用 max(server_value, request_value) 规则，并在超过 max_tokens 时直接报错。
新增 input_max_tokens 校验：
- 在 prompt 被截断前先检查输入长度；
- 当输入 token 数超过 input_max_tokens 时，直接拒绝请求。
新增 truncate_prompt_tokens 策略：
- 默认开启，超出 max_model_len 时自动截断；
- 关闭后，超出 max_model_len 时直接抛出错误。
调整 engine / engine_client 中默认 max_tokens 的处理逻辑：
- 若配置了 max_completion_tokens，优先使用该值作为默认生成长度；
- 否则保持原有基于 max_model_len 的默认行为。
补充中英文参数文档：
- docs/parameters.md
- docs/zh/parameters.md

Usage or Command

示例启动参数：

--max-completion-tokens 1024 \
--reasoning-max-tokens 512 \
--response-max-tokens 512 \
--min-completion-tokens 1 \
--input-max-tokens 4096 \
--truncate-prompt-tokens

如需关闭超长 prompt 自动截断，可使用：

--no-truncate-prompt-tokens

行为说明：

当请求未指定 max_tokens 时，默认使用服务级配置 max_completion_tokens，并受上下文剩余长度约束；
当请求指定了 max_tokens 时，最终值会被限制为 min(请求值, 服务级上限, 上下文剩余长度)；
当请求未指定 reasoning_max_tokens / response_max_tokens 时，可使用服务级默认值；
reasoning_max_tokens / response_max_tokens 的最终值不会超过 max_tokens；
min_tokens 的最终值取服务端配置与请求值中的较大者，若超过 max_tokens 会直接报错；
当输入 prompt token 数超过 input_max_tokens 时，请求会被直接拒绝；
当输入超过 max_model_len 时：
- 若开启 truncate_prompt_tokens，则自动截断；
- 若关闭，则直接报错。

Accuracy Tests

该 PR 不涉及模型前向计算逻辑或 kernel 行为修改，因此无精度测试影响。

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-18T06:33:53Z

Thanks for your contribution!

Copilot

Pull request overview

本 PR 在服务端引入了若干"默认 token 长度限制"配置 (max_completion_tokens / reasoning_max_tokens / response_max_tokens / min_completion_tokens / input_max_tokens)，允许通过 CLI 设置 server-level 默认值；当请求未携带相应字段时使用这些默认值，超过 input_max_tokens 的请求将被拒绝。

Changes:

在 EngineArgs/ModelConfig 上新增 5 个长度相关参数，并在 CLI 和文档中暴露
在 BaseDataProcessor 上新增 set_server_defaults，并在 engine_client / async_llm / engine / common_engine 各入口处调用以同步 server defaults
在 base_processor.py 与 multimodal_processor.py 中加入"超长拒绝"以及"用户值/服务端默认值/上下文上限取最小"的合并逻辑

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
fastdeploy/engine/args_utils.py	新增 5 个 server-level token 长度相关参数及对应 CLI 选项
fastdeploy/config.py	`ModelConfig` 初始化新字段（默认 None / 1）以接受新参数
fastdeploy/input/base_processor.py	新增 `set_server_defaults` 和 `process_request_dict` 中的长度合并/拒绝逻辑
fastdeploy/input/multimodal_processor.py	多模态处理流程中加入同样的长度合并/拒绝逻辑
fastdeploy/entrypoints/engine_client.py	调用 `set_server_defaults`，并在缺失 `max_tokens` 时用 `max_completion_tokens` 兜底
fastdeploy/engine/engine.py	同上：注入 server defaults 并优先使用 `max_completion_tokens` 作为缺省
fastdeploy/engine/common_engine.py	创建 data_processor 后注入 server defaults
fastdeploy/engine/async_llm.py	创建 data_processor 后注入 server defaults
docs/parameters.md / docs/zh/parameters.md	文档同步新增 5 个参数说明

PaddlePaddle-bot · 2026-05-18T07:00:45Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 21:45:19

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 6e1fa26
Merge base: bda1756 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有 Required 任务全部通过 ✅，建议通过合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
46(0)	46	45	0	0	1	0

2 任务状态汇总

2.1 Required任务 : 10/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	其余 10 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 35/36 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
⏸️	`CI_HPU`	-	-	-
✅	其余 35 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

+        model_group.add_argument(
+            "--truncate-prompt-tokens",
+            type=lambda x: x.lower() in ("true", "1", "yes"),
+            default=EngineArgs.truncate_prompt_tokens,
+            help="Whether to truncate prompts that exceed max_model_len. "
+            "If True (default), prompts are silently truncated. "
+            "If False, a ValueError is raised.",
+        )


Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 19:53:45

📋 Review 摘要

PR 概述：新增服务级 token 长度限制与 prompt 截断控制，支持 max_completion_tokens、reasoning_max_tokens、response_max_tokens、min_completion_tokens、input_max_tokens、truncate_prompt_tokens 六项服务级参数。

变更范围：fastdeploy/config.py、engine/args_utils.py、engine/{async_llm,common_engine,engine}.py、entrypoints/engine_client.py、input/{base_processor,multimodal_processor}.py、docs/

影响面 Tag：[FDConfig] [Engine] [APIServer] [DataProcessor] [Feature] [Docs]

问题

级别	文件	概述
🟡 建议	`fastdeploy/input/base_processor.py` & `multimodal_processor.py`	token 长度限制逻辑约 50 行完全重复，应提取到基类
🟡 建议	`fastdeploy/entrypoints/engine_client.py`	`format_and_add_data` 预设 `max_tokens=max_completion_tokens`，导致 processor 二次 clamp，职责不单一
❓ 疑问	`fastdeploy/config.py` `ServingLimitsConfig.__init__`	`value != "None"` 使用字符串比较而非 `value is not None`
📝 PR 规范	Checklist	`[ ] Add unit tests` 未勾选但 PR 确实新增了大量单元测试

🟡 建议 1：`base_processor.py` 与 `multimodal_processor.py` 中约 50 行完全重复的 token 限制逻辑

process_request_dict 中从 _min_non_none 定义到 min_tokens 校验的整块逻辑在两个文件中逐行复制：

_min_non_none 定义
context_remaining 计算
max_tokens 默认/clamp 逻辑
reasoning_max_tokens clamp 逻辑
response_max_tokens clamp 逻辑
min_tokens 合并与校验

建议在 BaseProcessor（或其公用 mixin）中提取保护方法：

def _apply_token_limits(self, request: dict, max_model_len: int) -> None:
    """Apply server-level and context-remaining token limits in-place."""
    def _min_non_none(*values):
        return min(v for v in values if v is not None)

    context_remaining = max(1, max_model_len - len(request["prompt_token_ids"]))
    if request.get("max_tokens") is None:
        if self.max_completion_tokens is not None:
            request["max_tokens"] = max(1, min(context_remaining, self.max_completion_tokens))
        else:
            request["max_tokens"] = context_remaining
    else:
        request["max_tokens"] = _min_non_none(context_remaining, self.max_completion_tokens, request["max_tokens"])

    max_tokens = request["max_tokens"]
    if self.reasoning_max_tokens is not None or request.get("reasoning_max_tokens") is not None:
        request["reasoning_max_tokens"] = _min_non_none(max_tokens, self.reasoning_max_tokens, request.get("reasoning_max_tokens"))
    if self.response_max_tokens is not None or request.get("response_max_tokens") is not None:
        request["response_max_tokens"] = _min_non_none(max_tokens, self.response_max_tokens, request.get("response_max_tokens"))

    server_min, user_min = self.min_completion_tokens, request.get("min_tokens")
    effective_min = max(server_min, user_min) if (server_min is not None and user_min is not None) else (server_min or user_min)
    if effective_min is not None:
        if effective_min > max_tokens:
            raise ValueError(f"min_tokens ({effective_min}) must not exceed max_tokens ({max_tokens})")
        request["min_tokens"] = effective_min

两处 process_request_dict 直接调用 self._apply_token_limits(request, max_model_len) 即可消除重复，也避免未来修复时漏改一处。

🟡 建议 2：`engine_client.py::format_and_add_data` 中 `max_tokens` 双重处理

当前流程：

format_and_add_data 将 max_tokens 设为 max_completion_tokens（用户未指定时）
下游 data_processor.process_request_dict 将其视为"用户指定"，进入 else 分支，再次 clamp：_min_non_none(context_remaining, max_completion_tokens, request["max_tokens"])

结果虽正确（等价于 min(context_remaining, max_completion_tokens)），但职责混乱：engine_client 本无需关心 max_completion_tokens 的语义，只需维持"无 max_tokens 时兜底为 max_model_len - 1"的原有角色，由 processor 统一执行服务级约束即可。建议保持 engine_client 的原有兜底逻辑，把 max_completion_tokens 的应用职责完全留给 processor：

# engine_client.py  format_and_add_data
if "max_tokens" not in request:
    request["max_tokens"] = self.max_model_len - 1   # 仍作兜底，processor 负责 clamp

❓ 疑问 3：`ServingLimitsConfig.init` 的 `value != "None"` 字符串比较

for key, value in args.items():
    if hasattr(self, key) and value != "None":   # ← 字符串 "None" 而非 Python None
        setattr(self, key, value)

对 CLI argparse 传入的 Python None：None != "None" 为 True，会调用 setattr(self, key, None) ——与默认值相同，不产生实际危害，但语义令人困惑。对布尔型 False：False != "None" 为 True，行为正确。

若是为兼容 YAML/JSON 配置文件中的字符串 "None"，请在代码中注释说明此意图。
若仅服务于 CLI 场景，建议改为：

if hasattr(self, key) and value is not None:
    setattr(self, key, value)

📝 PR 规范检查

标题格式 [Feature] Add server-level token limits and prompt truncation control 符合规范，Tag 正确。

描述结构完整，Motivation/Modifications/Usage or Command/Accuracy Tests 均有实质内容。

唯一问题：Checklist 中 [ ] Add unit tests 未勾选，但 PR 实际新增了 test_text_processor.py、test_multimodal_processor.py 等多套单元测试，应勾选 [x]。

总体评价

功能设计合理，配置分层清晰，CLI/Config/Processor 三层均有同步，测试覆盖充分。主要改进点是消除 base_processor 与 multimodal_processor 之间约 50 行重复的 token 限制逻辑，以及明确 engine_client 与 data_processor 的职责边界。

luukunn added 4 commits May 15, 2026 17:11

增加长度控制参数

57e7f26

修改参数名

97b6cb4

修改参数校验

219b640

add docs

49549ca

Copilot AI review requested due to automatic review settings May 18, 2026 06:33

luukunn had a problem deploying to Metax_ci May 18, 2026 06:33 — with GitHub Actions Error

Copilot started reviewing on behalf of luukunn May 18, 2026 06:34 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread fastdeploy/input/base_processor.py Outdated

Comment thread fastdeploy/input/multimodal_processor.py Outdated

luukunn changed the title ~~Length~~ [Feature] Add server-level token length defaults and input token limit May 18, 2026

This comment was marked as outdated.

Sign in to view

fix default value

a9076a3

luukunn had a problem deploying to Metax_ci May 18, 2026 06:51 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

fix review

260b109

Copilot AI review requested due to automatic review settings May 18, 2026 07:17

luukunn had a problem deploying to Metax_ci May 18, 2026 07:18 — with GitHub Actions Error

Copilot started reviewing on behalf of luukunn May 18, 2026 07:18 View session

This comment was marked as outdated.

Sign in to view

fix error messages

bd93e46

luukunn had a problem deploying to Metax_ci May 18, 2026 07:32 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

add truncate_prompt_tokens

3b6d5a3

Copilot AI review requested due to automatic review settings May 18, 2026 07:54

luukunn had a problem deploying to Metax_ci May 18, 2026 07:54 — with GitHub Actions Failure

Copilot started reviewing on behalf of luukunn May 18, 2026 07:54 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

add unit test & fix review

5840abb

luukunn had a problem deploying to Metax_ci May 19, 2026 03:56 — with GitHub Actions Failure

Merge branch 'develop' into length

12230a5

Copilot AI review requested due to automatic review settings May 19, 2026 06:05

luukunn had a problem deploying to Metax_ci May 19, 2026 06:05 — with GitHub Actions Failure

Copilot started reviewing on behalf of luukunn May 19, 2026 06:05 View session

This comment was marked as resolved.

Sign in to view

update processor

797e6ae

luukunn had a problem deploying to Metax_ci May 19, 2026 06:37 — with GitHub Actions Failure

luukunn added 2 commits May 19, 2026 15:30

Merge branch 'upstream/develop': remove processor.py as per upstream

77d5168

remove test_processor.py

b74b9e6

Copilot AI review requested due to automatic review settings May 19, 2026 07:33

luukunn temporarily deployed to Metax_ci May 19, 2026 07:33 — with GitHub Actions Inactive

Copilot started reviewing on behalf of luukunn May 19, 2026 07:34 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

luukunn added 3 commits May 19, 2026 15:52

fix doc

2e5c28a

fix

ba212a6

fix unit test

a402104

luukunn temporarily deployed to Metax_ci May 19, 2026 09:17 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

luukunn changed the title ~~[Feature] Add server-level token length defaults and input token limit~~ [Feature] Add server-level token limits and prompt truncation control May 19, 2026

fix unit test

6e1fa26

Copilot AI review requested due to automatic review settings May 19, 2026 11:43

luukunn temporarily deployed to Metax_ci May 19, 2026 11:43 — with GitHub Actions Inactive

Copilot started reviewing on behalf of luukunn May 19, 2026 11:44 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

PaddlePaddle-bot reviewed May 19, 2026

View reviewed changes

luukunn requested review from Jiang-Jia-Jun and LiqinruiG May 19, 2026 12:17

Conversation

luukunn commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 10/10 通过

2.2 可选任务 — 35/36 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

🟡 建议 1：base_processor.py 与 multimodal_processor.py 中约 50 行完全重复的 token 限制逻辑

🟡 建议 2：engine_client.py::format_and_add_data 中 max_tokens 双重处理

❓ 疑问 3：ServingLimitsConfig.__init__ 的 value != "None" 字符串比较

📝 PR 规范检查

总体评价

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

luukunn commented May 18, 2026 •

edited

Loading

PaddlePaddle-bot commented May 18, 2026 •

edited

Loading

🟡 建议 1：`base_processor.py` 与 `multimodal_processor.py` 中约 50 行完全重复的 token 限制逻辑

🟡 建议 2：`engine_client.py::format_and_add_data` 中 `max_tokens` 双重处理

❓ 疑问 3：`ServingLimitsConfig.init` 的 `value != "None"` 字符串比较