Skip to content

[Feature] Add Deterministic Inference Support#6476

Open
gongweibao wants to merge 41 commits intoPaddlePaddle:developfrom
gongweibao:deter
Open

[Feature] Add Deterministic Inference Support#6476
gongweibao wants to merge 41 commits intoPaddlePaddle:developfrom
gongweibao:deter

Conversation

@gongweibao
Copy link

@gongweibao gongweibao commented Feb 13, 2026

[Feature] Add Deterministic Inference Support

Motivation

Implement deterministic inference support for FastDeploy to ensure reproducible results across multiple runs. Deterministic inference is critical for:

  • Debugging and testing models
  • Reproducing results in production
  • Ensuring consistency in distributed inference scenarios

The implementation addresses non-determinism sources in:

  1. All-Reduce operations in Tensor Parallelism (NCCL floating-point accumulation order)
  2. Batch-invariant operations (matrix multiplication, log_softmax, mean)
  3. Chunked Prefill alignment
  4. FlashAttention backend
  5. Sampling parameters seed management
  6. Scheduler request stealing

Modifications

Core Implementation

File Description
fastdeploy/envs.py Added FD_DETERMINISTIC_MODE environment variable
fastdeploy/__init__.py Auto-initialize custom all-reduce in deterministic mode
fastdeploy/distributed/communication.py Add deterministic mode checks and custom all-reduce integration
fastdeploy/engine/common_engine.py Add deterministic mode support
fastdeploy/engine/sampling_params.py Add deterministic parameter for sampling
fastdeploy/engine/sched/resource_manager_v1.py Add deterministic alignment logic
fastdeploy/model_executor/layers/attention/flash_attn_backend.py Add deterministic mode support for FlashAttention
fastdeploy/model_executor/layers/batch_invariant_ops/batch_invariant_ops.py Enhance batch-invariant operations
fastdeploy/model_executor/models/qwen2.py Add deterministic support for Qwen2 model
fastdeploy/scheduler/splitwise_scheduler.py Remove random request stealing in deterministic mode
fastdeploy/worker/gpu_model_runner.py Add deterministic mode handling

Key Features

  1. Custom All-Reduce for Deterministic TP: Forces custom all-reduce in deterministic mode with fixed accumulation order (unlike NCCL's dynamic algorithm)
  2. Batch-Invariant Operations: Triton-based implementations for matmul, log_softmax, and mean
  3. Chunked Prefill Alignment: Ensures truncation points align with split_kv_size integer multiples
  4. Deterministic Sampling: Seed-based sampling for reproducible results
  5. Error Handling: Explicit RuntimeErrors when deterministic requirements cannot be met

Usage or Command

Enable Deterministic Mode

export FD_DETERMINISTIC_MODE=1

Run Inference with Determinism

from fastdeploy import LLM
llm = LLM(...)
result = llm.generate(...)  # Automatically uses deterministic all-reduce

Run Tests

# All-reduce determinism test (requires 2+ GPUs)
python -m paddle.distributed.launch --gpus=0,1 tests/distributed/test_deterministic_all_reduce.py

# Batch-invariant operations test
python tests/batch_invariant_ops/test_batch_invariant_ops.py

# Sampling parameters determinism test
python tests/engine/test_sampling_params_determinism.py

Accuracy Tests

All unit tests pass:

  • Batch-invariant operations: 8 tests, 100% pass rate
  • Cache manager: 90 tests, 100% pass rate
  • Sampling parameters: 50 tests, 100% pass rate
  • Scheduler (local/dp): 42 tests, 100% pass rate
  • All-reduce determinism: Verified deterministic for float32/float16/bfloat16

Determinism Verification Results

======================================================================
Summary
======================================================================
Data Type       | Custom AR Deterministic   | NCCL Deterministic
----------------------------------------------------------------------
float32         | YES                      | NO
float16         | YES                      | NO
bfloat16        | YES                      | NO
======================================================================
Custom All-Reduce is deterministic for all supported types!
======================================================================

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch first.

@CLAassistant
Copy link

CLAassistant commented Feb 13, 2026

CLA assistant check
All committers have signed the CLA.

@paddle-bot
Copy link

paddle-bot bot commented Feb 13, 2026

Thanks for your contribution!

…manager

Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 33.67876% with 128 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@1405d7d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu_model_runner.py 10.61% 95 Missing and 6 partials ⚠️
fastdeploy/distributed/communication.py 50.00% 10 Missing and 2 partials ⚠️
fastdeploy/engine/sched/resource_manager_v1.py 86.66% 0 Missing and 4 partials ⚠️
fastdeploy/worker/worker_process.py 0.00% 3 Missing and 1 partial ⚠️
fastdeploy/worker/input_batch.py 25.00% 3 Missing ⚠️
fastdeploy/envs.py 50.00% 1 Missing and 1 partial ⚠️
.../layers/batch_invariant_ops/batch_invariant_ops.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6476   +/-   ##
==========================================
  Coverage           ?   69.61%           
==========================================
  Files              ?      392           
  Lines              ?    53572           
  Branches           ?     8410           
==========================================
  Hits               ?    37293           
  Misses             ?    13552           
  Partials           ?     2727           
Flag Coverage Δ
GPU 69.61% <33.67%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 9 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants