Skip to content

[TRTLLM-9523][chore] Refactor the transfer logic (step 6)#12231

Merged
Shixiaowei02 merged 2 commits intoNVIDIA:mainfrom
Shixiaowei02:feat/v2_adapt_4
Mar 20, 2026
Merged

[TRTLLM-9523][chore] Refactor the transfer logic (step 6)#12231
Shixiaowei02 merged 2 commits intoNVIDIA:mainfrom
Shixiaowei02:feat/v2_adapt_4

Conversation

@Shixiaowei02
Copy link
Collaborator

@Shixiaowei02 Shixiaowei02 commented Mar 16, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced decision logic for controlling data transfer based on network topology and duplication factors.
    • Improved session management with refined status semantics for transfer operations.
  • Bug Fixes

    • Fixed unclosed ZMQ DEALER socket.
    • Aligned Sender/Receiver handling of kv_tasks.
    • Added more try-catch to threads.
    • Restored lost inheritance from the Sender/Receiver.
    • Fixed type issues (None passed to int, __del__ called after ctor failure).
    • Adjusted locks in ReqInfoManager and Receiver to avoid deadlock.
  • Refactor

    • Restructured transfer orchestration and session management for improved modularity and maintainability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@Shixiaowei02
Copy link
Collaborator Author

/bot help

@github-actions
Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Shixiaowei02
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@Shixiaowei02
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39014 [ run ] triggered by Bot. Commit: 81c5775 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39016 [ run ] triggered by Bot. Commit: 81c5775 Link to invocation

@Shixiaowei02 Shixiaowei02 force-pushed the feat/v2_adapt_4 branch 2 times, most recently from bb63e6c to a27265e Compare March 16, 2026 07:40
@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-9523][feat] Additional adaptation to manager v2 (step 6) [TRTLLM-9523][chore] Refactor the transfer logic (step 6) Mar 16, 2026
@Shixiaowei02 Shixiaowei02 marked this pull request as ready for review March 16, 2026 09:28
@Shixiaowei02 Shixiaowei02 requested a review from a team as a code owner March 16, 2026 09:28
@Shixiaowei02
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39073 [ run ] triggered by Bot. Commit: 6790bfb Link to invocation

@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-9523][chore] Refactor the transfer logic (step 6) [TRTLLM-9523][fix] Fix and enhance the transfer logic (step 6) Mar 16, 2026
@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-9523][fix] Fix and enhance the transfer logic (step 6) [TRTLLM-9523][fix] Fix and refactor the transfer logic (step 6) Mar 16, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

This PR refactors the KV transfer session abstractions and transfer orchestration system. It replaces status enums, introduces new base classes (SessionArgsBase, TxSessionBase, RxSessionBase), implements a task-based transfer model, switches from task IDs to futures-based tracking, and updates endpoint attribute access patterns.

Changes

Cohort / File(s) Summary
Base Session Abstractions
tensorrt_llm/_torch/disaggregation/base/transfer.py
Refactored session lifecycle with new enum values (KV_TRANSFERRED, FULLY_TRANSFERRED); introduced SessionArgsBase container; replaced SessionState with abstract base classes TxSessionBase and RxSessionBase with send()/receive() methods returning Future objects; updated SenderBase and ReceiverBase to placeholder interfaces.
Peer Transfer Logic
tensorrt_llm/_torch/disaggregation/native/peer.py
Added defensive checks and two new decision methods: should_send_kv() and should_send_aux() to control KV and auxiliary data transmission; added guards in get_pool_mapping() and get_kv_map() for null page table and attention policy validation.
Core Transfer Engine
tensorrt_llm/_torch/disaggregation/native/transfer.py
Major architectural refactor from request-based to task-based transfer model; introduced SendTaskBase, KVSendTask, AuxSendTask, WriteMetaType, and AgentResult enums; refactored TxSession and RxSession with status methods and future-based send/receive; rewrote Sender and Receiver with per-session task queues, new result event handlers (KV_AGENT_RESULT, AUX_AGENT_RESULT), and enhanced error handling.
Transceiver Implementation
tensorrt_llm/_torch/disaggregation/native/py_cache_transceiver.py
Renamed internal attributes to public endpoint properties (_rank_info_server.endpointrank_info_server_endpoint); replaced per-request task ID tracking with futures (send_task_ids/recv_task_idssend_futures/recv_futures); updated status transitions to new enum values; refined context info endpoint formatting.
Unit Tests
tests/unittest/disaggregated/test_kv_transfer.py, tests/unittest/disaggregated/test_kv_transfer_mp.py
Updated status assertions to access session.status directly instead of session.state.status; updated enum comparisons (TRANSFERREDKV_TRANSFERRED, AUX_TRANSFERREDFULLY_TRANSFERRED).

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant TxS as TxSession
    participant Sndr as Sender
    participant RxS as RxSession
    participant Rcvr as Receiver
    participant Net as Network
    
    App->>TxS: send(slice)
    TxS->>Sndr: Create KVSendTask
    Sndr->>Sndr: Build WriteMeta with Future
    Sndr->>Net: Send KV data
    Net-->>Sndr: KV_AGENT_RESULT event
    Sndr-->>TxS: Resolve Future
    TxS-->>App: Future.result()
    
    Net->>RxS: Receive KV data
    RxS->>Rcvr: receive(slice)
    Rcvr->>Rcvr: Process data
    Rcvr-->>RxS: Resolve Future
    RxS-->>App: Future.result()
    
    App->>TxS: send_aux()
    TxS->>Sndr: Create AuxSendTask
    Sndr->>Net: Send auxiliary data
    Net-->>Sndr: AUX_AGENT_RESULT event
    Sndr->>TxS: Update session status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The pull request description is entirely a template placeholder with no substantive content provided by the author. Fill in the Description section with explanation of the architectural changes, rationale, and impact. Add Test Coverage section documenting which tests validate these changes. Complete all relevant PR Checklist items with specific details.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly references a specific issue ([TRTLLM-9523]) and describes the work as a refactoring step (step 6) of transfer logic, which aligns with the substantial API refactoring and restructuring of transfer session abstractions across multiple files shown in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.

OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/peer.py`:
- Around line 123-126: get_pool_mapping() currently fabricates a synthetic
mapping when either page table (self_pt or peer_pt) is None which leads
Sender._build_kv_write_meta() to iterate that entry and call
extract()/get_kv_map() (and get_kv_map() asserts both page tables exist).
Instead, change the None-page-table branch in get_pool_mapping() to
short-circuit with an empty mapping (e.g., return {} and cache it) so no KV
fragments are produced; also audit other identical branches (the similar branch
around lines 207-208) and apply the same empty-mapping short-circuit so callers
like Sender._build_kv_write_meta() never attempt to extract or call get_kv_map()
when page tables are missing.

In `@tensorrt_llm/_torch/disaggregation/native/py_cache_transceiver.py`:
- Around line 355-365: The check_gen_transfer_status() loop incorrectly treats a
successful recv_futures[request_id].result() (which only represents
recv_session.receive(kv_slice)) as full-session completion; change the logic to
also inspect the session state (use recv_session.status /
RxSession.FULLY_TRANSFERRED) before appending to completed_request_ids so only
sessions with status FULLY_TRANSFERRED are promoted, otherwise append to
failed_request_ids or leave for later; specifically update the branch that
checks sync_status from recv_futures[request_id] to require recv_session.status
== RxSession.FULLY_TRANSFERRED before setting DISAGG_GENERATION_TRANS_COMPLETE
and calling unpack_aux(), and ensure sessions in KV_TRANSFERRED or ERROR are not
treated as completed.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 992-1007: The receiver is dropping the transmitted slice
identifier so all KV responses collapse into index 0; in
_process_kv_agent_result (which calls decode_message and checks
MessageType.KV_AGENT_RESULT and _get_session) extract and convert the slice_id
from the decoded message (the field currently ignored between unique_rid and
is_last_slice_str) and pass that slice index into
session.process_kv_agent_result instead of always resolving index 0; update
RxSession.process_kv_agent_result to use the provided slice_id to resolve the
correct future in _kv_tasks (and apply the same fix to the analogous handling at
the other occurrence around the 1094-1103 region).
- Around line 759-767: The tests and in-tree callers expect send() and receive()
to return the integer index into the _kv_tasks list, but the new implementations
return a concurrent.futures.Future; restore the previous return contract by
returning the slice_id (an int) from send() and receive() while keeping
task.future available on the task object, i.e., after creating task and
appending to self.kv_tasks, return slice_id instead of task.future; apply the
same change for the other implementation block around the 1080-1092 region so
all callers continue to receive an int index.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1dab76b8-8c83-48c5-ab37-586f79d48cbd

📥 Commits

Reviewing files that changed from the base of the PR and between 87658f2 and 6790bfb.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/disaggregation/base/transfer.py
  • tensorrt_llm/_torch/disaggregation/native/peer.py
  • tensorrt_llm/_torch/disaggregation/native/py_cache_transceiver.py
  • tensorrt_llm/_torch/disaggregation/native/transfer.py
  • tests/unittest/disaggregated/test_kv_transfer.py
  • tests/unittest/disaggregated/test_kv_transfer_mp.py

@Shixiaowei02
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39079 [ run ] triggered by Bot. Commit: 367631c Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39079 [ run ] completed with state SUCCESS. Commit: 367631c
/LLM/main/L0_MergeRequest_PR pipeline #30342 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39488 [ run ] triggered by Bot. Commit: 99ceb96 Link to invocation

@Shixiaowei02
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39536 [ run ] triggered by Bot. Commit: 5d63aac Link to invocation

@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-9523][fix] Fix and refactor the transfer logic (step 6) [TRTLLM-9523][chore] Refactor the transfer logic (step 6) Mar 19, 2026
@tensorrt-cicd
Copy link
Collaborator

PR_Github #39536 [ run ] completed with state SUCCESS. Commit: 5d63aac
/LLM/main/L0_MergeRequest_PR pipeline #30754 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02
Copy link
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39592 [ run ] triggered by Bot. Commit: 5d63aac Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39592 [ run ] completed with state SUCCESS. Commit: 5d63aac
/LLM/main/L0_MergeRequest_PR pipeline #30805 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@Shixiaowei02
Copy link
Collaborator Author

/bot skip --comment "CI passed at 39536 and 39592 in the assembled form."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39594 [ skip ] triggered by Bot. Commit: 5d63aac Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39594 [ skip ] completed with state SUCCESS. Commit: 5d63aac
Skipping testing for commit 5d63aac

Link to invocation

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39668 [ run ] triggered by Bot. Commit: cf973a2 Link to invocation

@Shixiaowei02 Shixiaowei02 enabled auto-merge (squash) March 20, 2026 06:34
@tensorrt-cicd
Copy link
Collaborator

PR_Github #39668 [ run ] completed with state SUCCESS. Commit: cf973a2
/LLM/main/L0_MergeRequest_PR pipeline #30870 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02
Copy link
Collaborator Author

Shixiaowei02 commented Mar 20, 2026

/bot skip --comment "Confirmed with infra, pipeline #39668 actually passed."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39728 [ skip ] triggered by Bot. Commit: cf973a2 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39728 [ skip ] completed with state SUCCESS. Commit: cf973a2
Skipping testing for commit cf973a2

Link to invocation

@Shixiaowei02 Shixiaowei02 merged commit a40424f into NVIDIA:main Mar 20, 2026
5 checks passed
@Shixiaowei02 Shixiaowei02 deleted the feat/v2_adapt_4 branch March 20, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants