[TRTLLM-9523][chore] Refactor the transfer logic (step 6)#12231
[TRTLLM-9523][chore] Refactor the transfer logic (step 6)#12231Shixiaowei02 merged 2 commits intoNVIDIA:mainfrom
Conversation
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run --add-multi-gpu-test --disable-fail-fast |
eb43f70 to
81c5775
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #39014 [ run ] triggered by Bot. Commit: |
|
PR_Github #39016 [ run ] triggered by Bot. Commit: |
bb63e6c to
a27265e
Compare
8611c48 to
428c20c
Compare
428c20c to
6790bfb
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #39073 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR refactors the KV transfer session abstractions and transfer orchestration system. It replaces status enums, introduces new base classes ( Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant TxS as TxSession
participant Sndr as Sender
participant RxS as RxSession
participant Rcvr as Receiver
participant Net as Network
App->>TxS: send(slice)
TxS->>Sndr: Create KVSendTask
Sndr->>Sndr: Build WriteMeta with Future
Sndr->>Net: Send KV data
Net-->>Sndr: KV_AGENT_RESULT event
Sndr-->>TxS: Resolve Future
TxS-->>App: Future.result()
Net->>RxS: Receive KV data
RxS->>Rcvr: receive(slice)
Rcvr->>Rcvr: Process data
Rcvr-->>RxS: Resolve Future
RxS-->>App: Future.result()
App->>TxS: send_aux()
TxS->>Sndr: Create AuxSendTask
Sndr->>Net: Send auxiliary data
Net-->>Sndr: AUX_AGENT_RESULT event
Sndr->>TxS: Update session status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/peer.py`:
- Around line 123-126: get_pool_mapping() currently fabricates a synthetic
mapping when either page table (self_pt or peer_pt) is None which leads
Sender._build_kv_write_meta() to iterate that entry and call
extract()/get_kv_map() (and get_kv_map() asserts both page tables exist).
Instead, change the None-page-table branch in get_pool_mapping() to
short-circuit with an empty mapping (e.g., return {} and cache it) so no KV
fragments are produced; also audit other identical branches (the similar branch
around lines 207-208) and apply the same empty-mapping short-circuit so callers
like Sender._build_kv_write_meta() never attempt to extract or call get_kv_map()
when page tables are missing.
In `@tensorrt_llm/_torch/disaggregation/native/py_cache_transceiver.py`:
- Around line 355-365: The check_gen_transfer_status() loop incorrectly treats a
successful recv_futures[request_id].result() (which only represents
recv_session.receive(kv_slice)) as full-session completion; change the logic to
also inspect the session state (use recv_session.status /
RxSession.FULLY_TRANSFERRED) before appending to completed_request_ids so only
sessions with status FULLY_TRANSFERRED are promoted, otherwise append to
failed_request_ids or leave for later; specifically update the branch that
checks sync_status from recv_futures[request_id] to require recv_session.status
== RxSession.FULLY_TRANSFERRED before setting DISAGG_GENERATION_TRANS_COMPLETE
and calling unpack_aux(), and ensure sessions in KV_TRANSFERRED or ERROR are not
treated as completed.
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 992-1007: The receiver is dropping the transmitted slice
identifier so all KV responses collapse into index 0; in
_process_kv_agent_result (which calls decode_message and checks
MessageType.KV_AGENT_RESULT and _get_session) extract and convert the slice_id
from the decoded message (the field currently ignored between unique_rid and
is_last_slice_str) and pass that slice index into
session.process_kv_agent_result instead of always resolving index 0; update
RxSession.process_kv_agent_result to use the provided slice_id to resolve the
correct future in _kv_tasks (and apply the same fix to the analogous handling at
the other occurrence around the 1094-1103 region).
- Around line 759-767: The tests and in-tree callers expect send() and receive()
to return the integer index into the _kv_tasks list, but the new implementations
return a concurrent.futures.Future; restore the previous return contract by
returning the slice_id (an int) from send() and receive() while keeping
task.future available on the task object, i.e., after creating task and
appending to self.kv_tasks, return slice_id instead of task.future; apply the
same change for the other implementation block around the 1080-1092 region so
all callers continue to receive an int index.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1dab76b8-8c83-48c5-ab37-586f79d48cbd
📒 Files selected for processing (6)
tensorrt_llm/_torch/disaggregation/base/transfer.pytensorrt_llm/_torch/disaggregation/native/peer.pytensorrt_llm/_torch/disaggregation/native/py_cache_transceiver.pytensorrt_llm/_torch/disaggregation/native/transfer.pytests/unittest/disaggregated/test_kv_transfer.pytests/unittest/disaggregated/test_kv_transfer_mp.py
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #39079 [ run ] triggered by Bot. Commit: |
|
PR_Github #39079 [ run ] completed with state
|
d521906 to
99ceb96
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #39488 [ run ] triggered by Bot. Commit: |
99ceb96 to
5d63aac
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #39536 [ run ] triggered by Bot. Commit: |
|
PR_Github #39536 [ run ] completed with state
|
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-1" |
|
PR_Github #39592 [ run ] triggered by Bot. Commit: |
|
PR_Github #39592 [ run ] completed with state |
|
/bot skip --comment "CI passed at 39536 and 39592 in the assembled form." |
|
PR_Github #39594 [ skip ] triggered by Bot. Commit: |
|
PR_Github #39594 [ skip ] completed with state |
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
5d63aac to
30113e4
Compare
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #39668 [ run ] triggered by Bot. Commit: |
|
PR_Github #39668 [ run ] completed with state
|
|
/bot skip --comment "Confirmed with infra, pipeline #39668 actually passed." |
|
PR_Github #39728 [ skip ] triggered by Bot. Commit: |
|
PR_Github #39728 [ skip ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
DEALERsocket.Sender/Receiverhandling ofkv_tasks.try-catchto threads.Sender/Receiver.Nonepassed toint,__del__called after ctor failure).ReqInfoManagerandReceiverto avoid deadlock.Refactor
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.