Skip to content

cuD-PDLP#1391

Open
Bubullzz wants to merge 143 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP
Open

cuD-PDLP#1391
Bubullzz wants to merge 143 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP

Conversation

@Bubullzz

@Bubullzz Bubullzz commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implemented metis-partitionned multi-GPU PDLP.

To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with --distributed-pdlp-num-gpus n

All benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet

Here is the bottom line of the results
On 8 NVLINKed B200 :

against CuOpt :

  • speedup : at least 2.5x and up to 7.08x (tsp-gaia-10m.mps)
  • memory footprint : ~8x on most instances

against D-PDLP

  • speedup: slower on most instances but faster on the bigger ones (psr_100, tsp-gaia-10m, ELMOD_876_10_noVEname). getting up to a 2x speedup on ELMOD_876_10_noVEname.
  • memory footprint : they consistently have a better memory footprint than we do but on the bigger instances it does not go over 20% extra footprint

to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work

closes #891

Bubullzz added 30 commits May 7, 2026 15:07
@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 380fd26

double restart_k_d = 0.0;
double restart_i_smooth = 0.3;
bool use_conditional_major = true;
bool use_distributed_pdlp = false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the solver_settings, not here. Also I'm wondering if the value shouldn't be an int just like we have num_gpus for LP to run IPM and PDLP on seperate GPU. The default value would be 1, 0 or -1 could be the automatic/best value determined at runtime, else just use as many GPUs as the user has set

int num_gpus{1};
// Number of GPUs to use specifically for distributed PDLP (use_distributed_pdlp=true).
// -1 means auto-detect
int distributed_pdlp_num_gpus{-1};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the right parameter to use

// "kaminpar" - multi-threaded KaMinPar
std::string distributed_pdlp_partitioner{"auto"};
// Set to true inside the shards
bool is_distributed_sub_pdlp{false};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is more an internal parameter that user shouldn't touch and know about I think it should leave in the pdlp_solver_object and then you would do pdlp_solver_object.set_distributed_sub_pdlp() on it so that this pdlp_solver_object knows it's being called within a multi GPU context

// "auto" - 1 GPU => Dummy; otherwise KaMinPar
// "dummy" - round-robin, no graph (trivial)
// "kaminpar" - multi-threaded KaMinPar
std::string distributed_pdlp_partitioner{"auto"};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense for this to be an enum with specific values rather than a string

// If non-empty, the partition computed for distributed PDLP is written to this
// path (one part-id per line) right after partitioning. The file can be fed
// back via multi_gpu_partition_file.
std::string multi_gpu_export_partition_file{""};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the usage of such parameter for a user? I'm not sure I undestand

// invoked with closure accessors).
template <typename f_t>
struct sqrt_inplace_op_t {
__host__ __device__ f_t operator()(f_t x) const { return raft::sqrt(x); }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super important but I think we should stop using raft::sqrt and use cuda::std::sqrt instead

struct multi_gpu_engine_t {
// Constructs shards from rank_data. The global (unpartitioned) problem is
// read straight from `mps`; each shard slices out the entries it owns.
multi_gpu_engine_t(std::vector<rank_data_t<i_t, f_t>>&& rank_data,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a specific reason to use && there and not just &?

{
for (auto& s : shards) {
raft::device_setter guard(s->device_id);
fn(*s);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! What is the behavior when fn is asynchronous from a GPU perspective? Will the asynchronous kernels launched within fn be correctly run on the device set above even if the calling thread has left the scope?

auto& sub = *shard.sub_pdlp;
// turns the Tuple of lambdas into a tuple of rmm::device_uvector
auto cub_inputs = std::apply(
[&sub](auto&... acc) { return cuda::std::make_tuple(acc(sub)...); }, in_accessors);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following, where are we turning things into rmm::device_uvector? Also can't we directly wrap things in a cuda::std::make_tuple?

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test d8a1fa8

@bbozkaya

Copy link
Copy Markdown
Member

--use-distributed-pdlp argument not recognized when I build cuOpt and run cuopt_cli

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 1a5b941

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test e267972

@bbozkaya

Copy link
Copy Markdown
Member

all CLI arguments work now, as validated on B200 with up to 8 GPUs.

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 368b3b3

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 6948bc5

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 1563cdc

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test d0de284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Multi GPU PDLP

3 participants