After loading a big model with vllm we are getting the following error / stack trace / core dump:
192.168.11.119: kern: warning: [2026-04-22T21:25:20.458998481Z]: NVRM: GPU at PCI:0000:48:00: GPU-5fc0bcec-c2e1-38dc-8994-2a9f9f3b709e
192.168.11.119: kern: err: [2026-04-22T21:25:20.466440686Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Severity 1 Engine instance 12 Sub-engine instance 00
192.168.11.119: kern: warning: [2026-04-22T21:25:20.473962892Z]: NVRM: GPU Board Serial Number: 1563521020414
192.168.11.119: kern: err: [2026-04-22T21:25:20.500049682Z]: nvidia-nvswitch1: SXid (PCI:0000:aa:00.0): 24007, Data {0x10000000, 0x10008100, 0x10000000, 0x10008100, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}
192.168.11.119: kern: warning: [2026-04-22T21:25:20.505406921Z]: NVRM: Xid (PCI:0000:48:00): 62, 00002740 00002b08 00001126 0000117a 0000279f 0002a91a 00000011 00000000
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560106789Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560271408Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560426825Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560534061Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560650492Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e
192.168.11.119: kern: warning: [2026-04-22T21:25:20.56075873Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f
192.168.11.119: kern: warning: [2026-04-22T21:25:20.560896391Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010
192.168.11.119: kern: warning: [2026-04-22T21:25:20.561005312Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011
192.168.11.119: kern: warning: [2026-04-22T21:26:20.562739402Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000a caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.584539473Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000b caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599024717Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000c caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599137616Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000d caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599257853Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000e caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599366618Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x0000000f caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599488292Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000010 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599595837Z]: NVRM: Xid (PCI:0000:48:00): 45, pid=22884, name=VLLM::Worker, channel 0x00000011 caused by previous Xid 45
192.168.11.119: kern: warning: [2026-04-22T21:26:20.599716692Z]: NVRM: Xid (PCI:0000:48:00): 74, pid=22884, name=VLLM::Worker_TP, NVLink: fatal error detected on link 8(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
192.168.11.119: kern: warning: [2026-04-22T21:26:20.695845708Z]: NVRM: Xid (PCI:0000:48:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:27:51.261401334Z]: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
192.168.11.119: kern: warning: [2026-04-22T21:27:51.271840481Z]: NVRM: _kgspLogXid119: Note: Please also check logs above.
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288755476Z]: NVRM: GPU at PCI:0000:0b:00: GPU-ff3dfd6d-7042-cc8c-f3a6-84d9d6eb3a4b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288764025Z]: NVRM: GPU Board Serial Number: 1563521019267
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288777915Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4127 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288813021Z]: NVRM: GPU3 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) sequence 4127 and data 0x0000000020803039 0x00000000000000b0.
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288821491Z]: NVRM: GPU3 RPC history (CPU -> GSP):
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288824425Z]: NVRM: entry function sequence data0 data1 ts_start ts_end duration actively_polling
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288830314Z]: NVRM: 0 76 GSP_RM_CONTROL 4127 0x0000000020803039 0x00000000000000b0 0x00065013160bb6e0 0x0000000000000000 y
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288836628Z]: NVRM: -1 76 GSP_RM_CONTROL 4126 0x000000002080a026 0x0000000000000214 0x00065013118cf0d0 0x00065013118cf158 136us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288844547Z]: NVRM: -2 76 GSP_RM_CONTROL 4125 0x000000002080a084 0x0000000000000004 0x00065013118cf063 0x00065013118cf0c2 95us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288848661Z]: NVRM: -3 76 GSP_RM_CONTROL 4124 0x0000000020809001 0x0000000000000008 0x00065013118ceff5 0x00065013118cf053 94us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28885631Z]: NVRM: -4 76 GSP_RM_CONTROL 4123 0x0000000020809064 0x0000000000000208 0x00065013118cef7b 0x00065013118cefe3 104us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288860424Z]: NVRM: -5 76 GSP_RM_CONTROL 4122 0x000000002080a026 0x0000000000000214 0x00065013118ceecc 0x00065013118cef56 138us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288869094Z]: NVRM: -6 76 GSP_RM_CONTROL 4121 0x000000002080a084 0x0000000000000004 0x00065013118cee5c 0x00065013118ceeb9 93us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288873188Z]: NVRM: -7 76 GSP_RM_CONTROL 4120 0x0000000020809001 0x0000000000000008 0x00065013118cede3 0x00065013118cee4b 104us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288880739Z]: NVRM: GPU3 RPC event history (CPU <- GSP):
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288883742Z]: NVRM: entry function sequence data0 data1 ts_start ts_end duration during_incomplete_rpc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288889732Z]: NVRM: 0 4098 GSP_RUN_CPU_SEQUENCER 0 0x00000000000001ea 0x0000000000003fe2 0x00065011fbc021a9 0x00065011fbc02a48 2207us
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288900158Z]: CPU: 3 UID: 0 PID: 11833 Comm: gpu-feature-dis Tainted: G O 6.18.18-talos #1 NONE
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288909147Z]: Tainted: [O]=OOT_MODULE
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288912031Z]: Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant XL675d Gen10 Plus, BIOS A47 08/07/2024
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28891896Z]: Call Trace:
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288924814Z]: <TASK>
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288933332Z]: dump_stack_lvl+0x5d/0x90
192.168.11.119: kern: warning: [2026-04-22T21:27:51.288946276Z]: _kgspRpcRecvPoll+0x725/0x870 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289079746Z]: _issueRpcAndWait+0xdd/0x970 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289152639Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289162609Z]: ? osGetCurrentThread+0x26/0x60 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289295522Z]: ? rmDeviceGpuLockIsOwner+0x29/0x90 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289426652Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289431635Z]: rpcRmApiControl_GSP+0x76f/0x940 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289507484Z]: knvlinkExecGspRmRpc_IMPL+0x68/0x140 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289597368Z]: knvlinkSyncLinkMasksAndVbiosInfo_IMPL+0xb7/0x1a0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289686107Z]: nvlinkCtrlCmdBusGetNvlinkCaps+0x92/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28977Z]: kceGetCeFromNvlinkConfig_IMPL+0x49/0xe0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.28990033Z]: knvlinkGetP2POptimalCEs_GP100+0x6c/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.289993324Z]: CliGetSystemP2pCaps+0x395/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290071203Z]: ? CliGetSystemP2pCaps+0x11d/0x630 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290147077Z]: cliresCtrlCmdSystemGetP2pCapsV2_IMPL+0xa2/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290223016Z]: resControl_IMPL+0x1a9/0x1b0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290282909Z]: serverControl+0x47e/0x590 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290345728Z]: _rmapiRmControl+0x4f2/0x820 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290423802Z]: rmapiControlWithSecInfo+0x79/0x140 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290497731Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290502725Z]: rmapiControlWithSecInfoTls+0x8f/0xf0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290578033Z]: _nv04ControlWithSecInfo+0x8d/0xa0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290649057Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290656345Z]: ? cred_has_capability.isra.0+0xa4/0x170
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29066627Z]: RmIoctl+0x90b/0xda0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290800879Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290805753Z]: ? os_acquire_spinlock+0x12/0x30 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29087688Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290880774Z]: ? portSyncSpinlockAcquire+0x18/0x30 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.290942992Z]: ? rm_ioctl+0x52/0x4f0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291069875Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291078766Z]: rm_ioctl+0x66/0x4f0 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29120365Z]: ? __check_object_size+0x215/0x230
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291213998Z]: nvidia_unlocked_ioctl+0x447/0x950 [nvidia]
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291273871Z]: __x64_sys_ioctl+0x9f/0x100
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291283179Z]: ? srso_alias_return_thunk+0x5/0xfbef5
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291288033Z]: do_syscall_64+0x78/0x940
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291307702Z]: entry_SYSCALL_64_after_hwframe+0x76/0x7e
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291312635Z]: RIP: 0033:0x7f8079a2d67b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291320264Z]: Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d 57 0f 00 f7 d8 64 89 01 48
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291324508Z]: RSP: 002b:00007f8005baec88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291332297Z]: RAX: ffffffffffffffda RBX: 0000000000000020 RCX: 00007f8079a2d67b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291335531Z]: RDX: 00007f8005baedd0 RSI: 00000000c020462a RDI: 000000000000000b
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291342611Z]: RBP: 00007f8005baece0 R08: 00007f8005baedd0 R09: 00007f8005baedec
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291345594Z]: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8005baedd0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291351073Z]: R13: 000000000000000b R14: 00000000c020462a R15: 00007f8005baeca0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291356167Z]: </TASK>
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291364354Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291370308Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291378137Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291383081Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291539511Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291543565Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvCpuctl : 00000040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291549985Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqmask : 00040040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291553129Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqdest : 00000040
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291559707Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrStat : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291562791Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrInfo : badf1500
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291568069Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrAddr : 0000000001e19e20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291572173Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvHubErrStat : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291577631Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconMailbox : 0:00000000 1:00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291581675Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqstat : 00009000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291587264Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqmode : 0000fc24
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291590368Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifInstblk : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291596327Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCtl : 00000090
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291600721Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifThrottle : 80000064
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29160622Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkBlk : 0:48215480 1:50125638
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291609354Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkCtl : 0:00000000 1:00000000
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291616123Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCg1 : 0000000f
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291619217Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 00 = 0x0000000005c27ca8
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291625875Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 01 = 0x0000000005c366cc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29162931Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 02 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29163573Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 03 = 0x0000000005c366c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291638994Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 04 = 0x0000000004d4b1c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291645653Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 05 = 0x0000000004d3f670
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291648737Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 06 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291657622Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 07 = 0x0000000004d3f5c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291660856Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 08 = 0x0000000004d4b1a0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291666514Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 09 = 0x0000000005c37b14
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291670789Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 10 = 0x0000000005c39948
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291676137Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 11 = 0x0000000005a07fe0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291679812Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 12 = 0x000000000400a35c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291686132Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 13 = 0x0000000005a0804c
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291689526Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 14 = 0x0000000005c398c0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291698842Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 15 = 0x0000000005c39c60
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291703737Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 16 = 0x0000000005c39e78
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291791108Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 17 = 0x000000000535bbf8
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291794483Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 18 = 0x0000000004d9cab0
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291805401Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 19 = 0x000000000535bbcc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291810409Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 20 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291829641Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 21 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291836572Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 22 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291844715Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 23 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.29184809Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 24 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291855031Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 25 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291859535Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 26 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291867569Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 27 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291870923Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 28 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291879187Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 29 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291882381Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 30 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291891544Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 31 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291894608Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 32 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291903372Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 33 = 0x000000000535bb20
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291906316Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: TRACE: 34 = 0x0000000004d9cacc
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291912866Z]: NVRM: _kgspLogXid119: ********************************************************************************
192.168.11.119: kern: warning: [2026-04-22T21:27:51.291919134Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4127!
192.168.11.119: kern: warning: [2026-04-22T21:29:21.342163186Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4128 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:29:21.360926643Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373933805Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373945078Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373963996Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.373980741Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:29:21.37399493Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4128!
192.168.11.119: kern: warning: [2026-04-22T21:30:51.494604223Z]: NVRM: Xid (PCI:0000:0b:00): 119, pid=11431, name=gpu-feature-dis, Timeout after 45s of waiting for RPC response from GPU3 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 4129 (0x20803039 0xb0).
192.168.11.119: kern: warning: [2026-04-22T21:30:51.513376162Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526390847Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526401628Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526418955Z]: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.52643656Z]: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 00000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526447914Z]: NVRM: nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset @ kernel_gsp.c:2387
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526483418Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: Core is booted.
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526530125Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT3 0x0000000000000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.526545409Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT4 0x0000000000000000
192.168.11.119: kern: warning: [2026-04-22T21:30:51.795769959Z]: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: [ERROR] ICD Halt command failed.
192.168.11.119: kern: warning: [2026-04-22T21:30:51.803749334Z]: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 4129!
192.168.11.119: kern: warning: [2026-04-22T21:30:51.870495804Z]: NVRM: Xid (PCI:0000:0b:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
192.168.11.119: kern: warning: [2026-04-22T21:37:31.524565304Z]: NVRM: nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10957
192.168.11.119: kern: warning: [2026-04-22T21:37:31.583305494Z]: NVRM: nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from pRmApi->Control(pRmApi, RES_GET_CLIENT_HANDLE(pKernelChannel), RES_GET_HANDLE(pKernelChannel), NVA06F_CTRL_CMD_STOP_CHANNEL, &stopChannelParams, sizeof(stopChannelParams)) @ nv_gpu_ops.c:10957
Not possible due Talos OS.
Loading safetensors checkpoint shards: 100% Completed | 64/64 [13:08<00:00, 12.33s/it]
Worker_TP0 pid=370)
(Worker_TP0 pid=370) INFO 04-22 21:17:55 [default_loader.py:384] Loading weights took 788.87 seconds
(Worker_TP0 pid=370) INFO 04-22 21:18:02 [gpu_model_runner.py:4820] Model loading took 72.02 GiB memory and 798.933240 seconds
(Worker_TP0 pid=370) INFO 04-22 21:18:02 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 4225 tokens, and profiled with 1 vision_chunk items of the maximum feature size.
(Worker_TP0 pid=370) INFO 04-22 21:18:22 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/1924dfafa8/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=370) INFO 04-22 21:18:22 [backends.py:1111] Dynamo bytecode transform time: 13.40 s
(Worker_TP0 pid=370) INFO 04-22 21:18:27 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(Worker_TP0 pid=370) INFO 04-22 21:18:38 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 15.04 s
(Worker_TP0 pid=370) INFO 04-22 21:18:43 [decorators.py:655] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/f088404224c4ce7d97f6740f4677a90f65e9ce57bb5c9143590028f4b2dd20cc/rank_0_0/model
(Worker_TP0 pid=370) INFO 04-22 21:18:43 [monitor.py:48] torch.compile took 34.01 s in total
(EngineCore pid=299) INFO 04-22 21:19:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=299) INFO 04-22 21:20:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=299) INFO 04-22 21:21:03 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP2 pid=372) ERROR 04-22 21:21:40 [multiproc_executor.py:949] WorkerProc hit an exception.
(...)
(Worker_TP2 pid=372) ERROR 04-22 21:21:40 [multiproc_executor.py:949] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
NVIDIA Open GPU Kernel Modules Version
580.126.20
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Talos OS v1.12.6
Kernel Release
6.18.18-talos
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
8x A100 SXM4
Describe the bug
After loading a big model with vllm we are getting the following error / stack trace / core dump:
To Reproduce
Use VLLM to load a big model like Kimi 2.6.
See also:
vllm-project/vllm#40652
Bug Incidence
Always
nvidia-bug-report.log.gz
Not possible due Talos OS.
More Info
Model: moonshootai/Kimi 2.6
vLLM Container Version: vllm/vllm-openai:v0.19.1
Talos OS v1.12.6
NVIDIA Toolkit: nvidia-container-toolkit-lts=580.126.20-v1.18.2
Fabric Manager: nvidia-fabricmanager-lts=580.126.20
Drivers: nvidia-open-gpu-kernel-modules-lts=580.126.20-v1.12.6
nvidia-gdrdrv-device=v2.5.1
VLLM Logs: