cuD-PDLP by Bubullzz · Pull Request #1391 · NVIDIA/cuopt

Bubullzz · 2026-06-04T15:24:15Z

Implemented metis-partitionned multi-GPU PDLP.

To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with --distributed-pdlp-num-gpus n

All benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet

Here is the bottom line of the results
On 8 NVLINKed B200 :

against CuOpt :

speedup : at least 2.5x and up to 7.08x (tsp-gaia-10m.mps)
memory footprint : ~8x on most instances

against D-PDLP

speedup: slower on most instances but faster on the bigger ones (psr_100, tsp-gaia-10m, ELMOD_876_10_noVEname). getting up to a 2x speedup on ELMOD_876_10_noVEname.
memory footprint : they consistently have a better memory footprint than we do but on the bigger instances it does not go over 20% extra footprint

to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work

closes #891

…he cycle seems to be fixed, cuopt compiles

…olver !!!

+ style too

…k on main

compiles and runs

Bubullzz · 2026-06-23T15:13:21Z

/ok to test 380fd26

Kh4ster · 2026-06-23T15:13:52Z

-  double restart_k_d                                              = 0.0;
-  double restart_i_smooth                                         = 0.3;
-  bool use_conditional_major                                      = true;
+  bool use_distributed_pdlp                                       = false;


This should be in the solver_settings, not here. Also I'm wondering if the value shouldn't be an int just like we have num_gpus for LP to run IPM and PDLP on seperate GPU. The default value would be 1, 0 or -1 could be the automatic/best value determined at runtime, else just use as many GPUs as the user has set

Kh4ster · 2026-06-23T15:14:29Z

  int num_gpus{1};
+  // Number of GPUs to use specifically for distributed PDLP (use_distributed_pdlp=true).
+  // -1 means auto-detect
+  int distributed_pdlp_num_gpus{-1};


Yes this is the right parameter to use

Kh4ster · 2026-06-23T15:16:48Z

+  //   "kaminpar" - multi-threaded KaMinPar
+  std::string distributed_pdlp_partitioner{"auto"};
+  // Set to true inside the shards
+  bool is_distributed_sub_pdlp{false};


Since this is more an internal parameter that user shouldn't touch and know about I think it should leave in the pdlp_solver_object and then you would do pdlp_solver_object.set_distributed_sub_pdlp() on it so that this pdlp_solver_object knows it's being called within a multi GPU context

Kh4ster · 2026-06-23T15:17:11Z

+  //   "auto"     - 1 GPU => Dummy; otherwise KaMinPar
+  //   "dummy"    - round-robin, no graph (trivial)
+  //   "kaminpar" - multi-threaded KaMinPar
+  std::string distributed_pdlp_partitioner{"auto"};


I think it would make more sense for this to be an enum with specific values rather than a string

Kh4ster · 2026-06-23T15:17:59Z

+  // If non-empty, the partition computed for distributed PDLP is written to this
+  // path (one part-id per line) right after partitioning. The file can be fed
+  // back via multi_gpu_partition_file.
+  std::string multi_gpu_export_partition_file{""};


What would be the usage of such parameter for a user? I'm not sure I undestand

Kh4ster · 2026-06-23T15:21:26Z

+// invoked with closure accessors).
+template <typename f_t>
+struct sqrt_inplace_op_t {
+  __host__ __device__ f_t operator()(f_t x) const { return raft::sqrt(x); }


Not super important but I think we should stop using raft::sqrt and use cuda::std::sqrt instead

Kh4ster · 2026-06-23T15:24:14Z

+struct multi_gpu_engine_t {
+  // Constructs shards from rank_data. The global (unpartitioned) problem is
+  // read straight from `mps`; each shard slices out the entries it owns.
+  multi_gpu_engine_t(std::vector<rank_data_t<i_t, f_t>>&& rank_data,


Was there a specific reason to use && there and not just &?

Kh4ster · 2026-06-23T15:25:50Z

+  {
+    for (auto& s : shards) {
+      raft::device_setter guard(s->device_id);
+      fn(*s);


Very cool! What is the behavior when fn is asynchronous from a GPU perspective? Will the asynchronous kernels launched within fn be correctly run on the device set above even if the calling thread has left the scope?

Kh4ster · 2026-06-23T15:30:32Z

+      auto& sub = *shard.sub_pdlp;
+      // turns the Tuple of lambdas into a tuple of rmm::device_uvector
+      auto cub_inputs = std::apply(
+        [&sub](auto&... acc) { return cuda::std::make_tuple(acc(sub)...); }, in_accessors);


Not sure I'm following, where are we turning things into rmm::device_uvector? Also can't we directly wrap things in a cuda::std::make_tuple?

Bubullzz · 2026-06-24T10:28:13Z

/ok to test d8a1fa8

…tings rather than hyper_params

bbozkaya · 2026-06-24T15:51:51Z

--use-distributed-pdlp argument not recognized when I build cuOpt and run cuopt_cli

Bubullzz · 2026-06-25T06:22:59Z

/ok to test 1a5b941

Bubullzz · 2026-06-25T06:24:40Z

/ok to test e267972

bbozkaya · 2026-06-25T13:53:26Z

all CLI arguments work now, as validated on B200 with up to 8 GPUs.

Bubullzz · 2026-06-25T16:14:25Z

/ok to test 368b3b3

Bubullzz · 2026-06-25T16:17:09Z

/ok to test 6948bc5

Bubullzz · 2026-06-25T21:00:11Z

/ok to test 1563cdc

Bubullzz · 2026-06-26T08:28:27Z

/ok to test d0de284

Bubullzz added 30 commits May 7, 2026 15:07

first commit !! added multi_gpu_partition file to solver settings

1e0bd53

slowly skeletonning

978d17b

better shard.cuh

dd0c0ef

wip

2037eca

added a bit of skeleton. Forward declared pdlp_solver in shard.hpp, t…

0f62eff

…he cycle seems to be fixed, cuopt compiles

still wip but going well

d89c85a

cursor broke everything grrr

5534ff0

partition loader now partition loads

dd935c5

big advancements ayo ! We can soon start working on imlementing the s…

09eb20b

…olver !!!

added pre loop setup need to manage boxing

b5ebfd2

+ style too

added distributed transform

0965a60

added semicolon and existing runtime error enum

d4d1cab

added } and fixed cuot_expects in partition loader

6659dd9

small bug fixes

b2ed271

a version that compiles #heheha 😎😎😎😎

50d16ce

removed use of engine:transaform

359d9f4

added multi-gpu SpMV #heheha

910a49a

transformed a transform. it compiles hehe

76c0b3f

updated take step for distributed. compiles but doesnt run. will chec…

5ec7138

…k on main

Merge branch 'main' into cuD-PDLP

1f02afd

support spmvop on multi-gpu

de19f38

compile ready

0030a6c

can run now

172ebc2

passing all tests, good merge

23d0798

fixed the errors hihi, finished distributed part for compte_fixed_error

30881ce

style

c33faf2

now manage halpern update in multi-gpu pdlp

98e0ce6

small fix to calls of multi_gpu_engine_ and scale/unscale solutions.

84128bf

compiles and runs

comments

abe4dd2

added is multi gpu to pdhg

5c41497

Kh4ster reviewed Jun 23, 2026

View reviewed changes

Merge branch 'main' into cuD-PDLP

d8a1fa8

Bubullzz added 3 commits June 24, 2026 15:54

removed pdlp_disable graph knob and moved use_distributed_pdlp to set…

a77d81b

…tings rather than hyper_params

added warning if not fully nvlink-connected

cb55b7b

style

1a5b941

Merge branch 'main' into cuD-PDLP

e267972

Bubullzz and others added 2 commits June 25, 2026 18:13

Merge branch 'main' into cuD-PDLP

df59a3d

removed not-working warning on non-nvlink system

368b3b3

style

6948bc5

updated assert to allow skeletal master problem to build

1563cdc

Bubullzz and others added 2 commits June 26, 2026 10:27

style

b14ffcc

Merge branch 'main' into cuD-PDLP

d0de284

Uh oh!

Conversation

Bubullzz commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

against CuOpt :

against D-PDLP

Uh oh!

Bubullzz commented Jun 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bubullzz commented Jun 24, 2026

Uh oh!

bbozkaya commented Jun 24, 2026

Uh oh!

Bubullzz commented Jun 25, 2026

Uh oh!

Bubullzz commented Jun 25, 2026

Uh oh!

bbozkaya commented Jun 25, 2026

Uh oh!

Bubullzz commented Jun 25, 2026

Uh oh!

Bubullzz commented Jun 25, 2026

Uh oh!

Bubullzz commented Jun 25, 2026

Uh oh!

Bubullzz commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bubullzz commented Jun 4, 2026 •

edited

Loading