Skip to content

[Feature] Add server-level token limits and prompt truncation control#7842

Open
luukunn wants to merge 19 commits into
PaddlePaddle:developfrom
luukunn:length
Open

[Feature] Add server-level token limits and prompt truncation control#7842
luukunn wants to merge 19 commits into
PaddlePaddle:developfrom
luukunn:length

Conversation

@luukunn
Copy link
Copy Markdown
Collaborator

@luukunn luukunn commented May 18, 2026

Motivation

本 PR 为服务端新增了统一的长度参数默认值配置能力,使用户在未显式传入请求级参数时,也可以通过服务级配置控制生成长度相关行为;同时新增了输入 token 长度限制,用于提前拦截超长请求。

Modifications

  • 新增服务级长度控制配置 ServingLimitsConfig,并挂载到 FDConfig 中统一管理。
  • 在 CLI / 配置项中新增以下服务级参数:
    • max_completion_tokens
    • reasoning_max_tokens
    • response_max_tokens
    • min_completion_tokens
    • input_max_tokens
    • truncate_prompt_tokens
  • async_llmcommon_engineengine_client 初始化阶段,将服务级默认长度配置注入 data_processor
  • 更新文本与多模态请求处理逻辑:
    • 当请求未显式指定 max_tokens 时,默认使用服务级 max_completion_tokens,并受剩余上下文长度约束;
    • 当请求显式指定 max_tokens 时,会同时受服务级上限和上下文剩余长度限制;
    • reasoning_max_tokens / response_max_tokens 会被约束为不超过最终生效的 max_tokens
    • min_tokens 采用 max(server_value, request_value) 规则,并在超过 max_tokens 时直接报错。
  • 新增 input_max_tokens 校验:
    • 在 prompt 被截断前先检查输入长度;
    • 当输入 token 数超过 input_max_tokens 时,直接拒绝请求。
  • 新增 truncate_prompt_tokens 策略:
    • 默认开启,超出 max_model_len 时自动截断;
    • 关闭后,超出 max_model_len 时直接抛出错误。
  • 调整 engine / engine_client 中默认 max_tokens 的处理逻辑:
    • 若配置了 max_completion_tokens,优先使用该值作为默认生成长度;
    • 否则保持原有基于 max_model_len 的默认行为。
  • 补充中英文参数文档:
    • docs/parameters.md
    • docs/zh/parameters.md

Usage or Command

示例启动参数:

--max-completion-tokens 1024 \
--reasoning-max-tokens 512 \
--response-max-tokens 512 \
--min-completion-tokens 1 \
--input-max-tokens 4096 \
--truncate-prompt-tokens

如需关闭超长 prompt 自动截断,可使用:

--no-truncate-prompt-tokens

行为说明:

  • 当请求未指定 max_tokens 时,默认使用服务级配置 max_completion_tokens,并受上下文剩余长度约束;
  • 当请求指定了 max_tokens 时,最终值会被限制为 min(请求值, 服务级上限, 上下文剩余长度)
  • 当请求未指定 reasoning_max_tokens / response_max_tokens 时,可使用服务级默认值;
  • reasoning_max_tokens / response_max_tokens 的最终值不会超过 max_tokens
  • min_tokens 的最终值取服务端配置与请求值中的较大者,若超过 max_tokens 会直接报错;
  • 当输入 prompt token 数超过 input_max_tokens 时,请求会被直接拒绝;
  • 当输入超过 max_model_len 时:
    • 若开启 truncate_prompt_tokens,则自动截断;
    • 若关闭,则直接报错。

Accuracy Tests

该 PR 不涉及模型前向计算逻辑或 kernel 行为修改,因此无精度测试影响。

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 18, 2026 06:33
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 18, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

本 PR 在服务端引入了若干"默认 token 长度限制"配置 (max_completion_tokens / reasoning_max_tokens / response_max_tokens / min_completion_tokens / input_max_tokens),允许通过 CLI 设置 server-level 默认值;当请求未携带相应字段时使用这些默认值,超过 input_max_tokens 的请求将被拒绝。

Changes:

  • EngineArgs/ModelConfig 上新增 5 个长度相关参数,并在 CLI 和文档中暴露
  • BaseDataProcessor 上新增 set_server_defaults,并在 engine_client / async_llm / engine / common_engine 各入口处调用以同步 server defaults
  • base_processor.pymultimodal_processor.py 中加入"超长拒绝"以及"用户值/服务端默认值/上下文上限取最小"的合并逻辑

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
fastdeploy/engine/args_utils.py 新增 5 个 server-level token 长度相关参数及对应 CLI 选项
fastdeploy/config.py ModelConfig 初始化新字段(默认 None / 1)以接受新参数
fastdeploy/input/base_processor.py 新增 set_server_defaultsprocess_request_dict 中的长度合并/拒绝逻辑
fastdeploy/input/multimodal_processor.py 多模态处理流程中加入同样的长度合并/拒绝逻辑
fastdeploy/entrypoints/engine_client.py 调用 set_server_defaults,并在缺失 max_tokens 时用 max_completion_tokens 兜底
fastdeploy/engine/engine.py 同上:注入 server defaults 并优先使用 max_completion_tokens 作为缺省
fastdeploy/engine/common_engine.py 创建 data_processor 后注入 server defaults
fastdeploy/engine/async_llm.py 创建 data_processor 后注入 server defaults
docs/parameters.md / docs/zh/parameters.md 文档同步新增 5 个参数说明

Comment thread fastdeploy/input/base_processor.py Outdated
Comment thread fastdeploy/input/multimodal_processor.py Outdated
@luukunn luukunn changed the title Length [Feature] Add server-level token length defaults and input token limit May 18, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 18, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 21:45:19

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务全部通过 ✅,建议通过合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
46(0) 46 45 0 0 1 0

2 任务状态汇总

2.1 Required任务 : 10/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 10 个必选任务通过 - - - - -

2.2 可选任务 — 35/36 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
⏸️ CI_HPU - - -
其余 35 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 18, 2026 07:17

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 18, 2026 07:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comment thread fastdeploy/input/base_processor.py Outdated
Comment thread fastdeploy/input/multimodal_processor.py Outdated
Comment on lines +830 to +837
model_group.add_argument(
"--truncate-prompt-tokens",
type=lambda x: x.lower() in ("true", "1", "yes"),
default=EngineArgs.truncate_prompt_tokens,
help="Whether to truncate prompts that exceed max_model_len. "
"If True (default), prompts are silently truncated. "
"If False, a ValueError is raised.",
)
Comment thread fastdeploy/entrypoints/engine_client.py
Comment thread fastdeploy/input/base_processor.py Outdated
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 19, 2026 06:05

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings May 19, 2026 07:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.

PaddlePaddle-bot

This comment was marked as outdated.

@luukunn luukunn changed the title [Feature] Add server-level token length defaults and input token limit [Feature] Add server-level token limits and prompt truncation control May 19, 2026
Copilot AI review requested due to automatic review settings May 19, 2026 11:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 19:53:45

📋 Review 摘要

PR 概述:新增服务级 token 长度限制与 prompt 截断控制,支持 max_completion_tokensreasoning_max_tokensresponse_max_tokensmin_completion_tokensinput_max_tokenstruncate_prompt_tokens 六项服务级参数。

变更范围fastdeploy/config.pyengine/args_utils.pyengine/{async_llm,common_engine,engine}.pyentrypoints/engine_client.pyinput/{base_processor,multimodal_processor}.pydocs/

影响面 Tag[FDConfig] [Engine] [APIServer] [DataProcessor] [Feature] [Docs]

问题

级别 文件 概述
🟡 建议 fastdeploy/input/base_processor.py & multimodal_processor.py token 长度限制逻辑约 50 行完全重复,应提取到基类
🟡 建议 fastdeploy/entrypoints/engine_client.py format_and_add_data 预设 max_tokens=max_completion_tokens,导致 processor 二次 clamp,职责不单一
❓ 疑问 fastdeploy/config.py ServingLimitsConfig.__init__ value != "None" 使用字符串比较而非 value is not None
📝 PR 规范 Checklist [ ] Add unit tests 未勾选但 PR 确实新增了大量单元测试

🟡 建议 1:base_processor.pymultimodal_processor.py 中约 50 行完全重复的 token 限制逻辑

process_request_dict 中从 _min_non_none 定义到 min_tokens 校验的整块逻辑在两个文件中逐行复制

_min_non_none 定义
context_remaining 计算
max_tokens 默认/clamp 逻辑
reasoning_max_tokens clamp 逻辑
response_max_tokens clamp 逻辑
min_tokens 合并与校验

建议在 BaseProcessor(或其公用 mixin)中提取保护方法:

def _apply_token_limits(self, request: dict, max_model_len: int) -> None:
    """Apply server-level and context-remaining token limits in-place."""
    def _min_non_none(*values):
        return min(v for v in values if v is not None)

    context_remaining = max(1, max_model_len - len(request["prompt_token_ids"]))
    if request.get("max_tokens") is None:
        if self.max_completion_tokens is not None:
            request["max_tokens"] = max(1, min(context_remaining, self.max_completion_tokens))
        else:
            request["max_tokens"] = context_remaining
    else:
        request["max_tokens"] = _min_non_none(context_remaining, self.max_completion_tokens, request["max_tokens"])

    max_tokens = request["max_tokens"]
    if self.reasoning_max_tokens is not None or request.get("reasoning_max_tokens") is not None:
        request["reasoning_max_tokens"] = _min_non_none(max_tokens, self.reasoning_max_tokens, request.get("reasoning_max_tokens"))
    if self.response_max_tokens is not None or request.get("response_max_tokens") is not None:
        request["response_max_tokens"] = _min_non_none(max_tokens, self.response_max_tokens, request.get("response_max_tokens"))

    server_min, user_min = self.min_completion_tokens, request.get("min_tokens")
    effective_min = max(server_min, user_min) if (server_min is not None and user_min is not None) else (server_min or user_min)
    if effective_min is not None:
        if effective_min > max_tokens:
            raise ValueError(f"min_tokens ({effective_min}) must not exceed max_tokens ({max_tokens})")
        request["min_tokens"] = effective_min

两处 process_request_dict 直接调用 self._apply_token_limits(request, max_model_len) 即可消除重复,也避免未来修复时漏改一处。


🟡 建议 2:engine_client.py::format_and_add_datamax_tokens 双重处理

当前流程:

  1. format_and_add_datamax_tokens 设为 max_completion_tokens(用户未指定时)
  2. 下游 data_processor.process_request_dict 将其视为"用户指定",进入 else 分支,再次 clamp:_min_non_none(context_remaining, max_completion_tokens, request["max_tokens"])

结果虽正确(等价于 min(context_remaining, max_completion_tokens)),但职责混乱:engine_client 本无需关心 max_completion_tokens 的语义,只需维持"无 max_tokens 时兜底为 max_model_len - 1"的原有角色,由 processor 统一执行服务级约束即可。建议保持 engine_client 的原有兜底逻辑,把 max_completion_tokens 的应用职责完全留给 processor:

# engine_client.py  format_and_add_data
if "max_tokens" not in request:
    request["max_tokens"] = self.max_model_len - 1   # 仍作兜底,processor 负责 clamp

❓ 疑问 3:ServingLimitsConfig.__init__value != "None" 字符串比较

for key, value in args.items():
    if hasattr(self, key) and value != "None":   # ← 字符串 "None" 而非 Python None
        setattr(self, key, value)

对 CLI argparse 传入的 Python NoneNone != "None"True,会调用 setattr(self, key, None) ——与默认值相同,不产生实际危害,但语义令人困惑。对布尔型 FalseFalse != "None"True,行为正确。

  • 若是为兼容 YAML/JSON 配置文件中的字符串 "None",请在代码中注释说明此意图。
  • 若仅服务于 CLI 场景,建议改为:
if hasattr(self, key) and value is not None:
    setattr(self, key, value)

📝 PR 规范检查

标题格式 [Feature] Add server-level token limits and prompt truncation control 符合规范,Tag 正确。

描述结构完整,Motivation/Modifications/Usage or Command/Accuracy Tests 均有实质内容。

唯一问题:Checklist 中 [ ] Add unit tests 未勾选,但 PR 实际新增了 test_text_processor.pytest_multimodal_processor.py 等多套单元测试,应勾选 [x]

总体评价

功能设计合理,配置分层清晰,CLI/Config/Processor 三层均有同步,测试覆盖充分。主要改进点是消除 base_processormultimodal_processor 之间约 50 行重复的 token 限制逻辑,以及明确 engine_clientdata_processor 的职责边界。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants