cuD-PDLP#1391
Conversation
…he cycle seems to be fixed, cuopt compiles
+ style too
compiles and runs
|
/ok to test 380fd26 |
| double restart_k_d = 0.0; | ||
| double restart_i_smooth = 0.3; | ||
| bool use_conditional_major = true; | ||
| bool use_distributed_pdlp = false; |
There was a problem hiding this comment.
This should be in the solver_settings, not here. Also I'm wondering if the value shouldn't be an int just like we have num_gpus for LP to run IPM and PDLP on seperate GPU. The default value would be 1, 0 or -1 could be the automatic/best value determined at runtime, else just use as many GPUs as the user has set
| int num_gpus{1}; | ||
| // Number of GPUs to use specifically for distributed PDLP (use_distributed_pdlp=true). | ||
| // -1 means auto-detect | ||
| int distributed_pdlp_num_gpus{-1}; |
There was a problem hiding this comment.
Yes this is the right parameter to use
| // "kaminpar" - multi-threaded KaMinPar | ||
| std::string distributed_pdlp_partitioner{"auto"}; | ||
| // Set to true inside the shards | ||
| bool is_distributed_sub_pdlp{false}; |
There was a problem hiding this comment.
Since this is more an internal parameter that user shouldn't touch and know about I think it should leave in the pdlp_solver_object and then you would do pdlp_solver_object.set_distributed_sub_pdlp() on it so that this pdlp_solver_object knows it's being called within a multi GPU context
| // "auto" - 1 GPU => Dummy; otherwise KaMinPar | ||
| // "dummy" - round-robin, no graph (trivial) | ||
| // "kaminpar" - multi-threaded KaMinPar | ||
| std::string distributed_pdlp_partitioner{"auto"}; |
There was a problem hiding this comment.
I think it would make more sense for this to be an enum with specific values rather than a string
| // If non-empty, the partition computed for distributed PDLP is written to this | ||
| // path (one part-id per line) right after partitioning. The file can be fed | ||
| // back via multi_gpu_partition_file. | ||
| std::string multi_gpu_export_partition_file{""}; |
There was a problem hiding this comment.
What would be the usage of such parameter for a user? I'm not sure I undestand
| // invoked with closure accessors). | ||
| template <typename f_t> | ||
| struct sqrt_inplace_op_t { | ||
| __host__ __device__ f_t operator()(f_t x) const { return raft::sqrt(x); } |
There was a problem hiding this comment.
Not super important but I think we should stop using raft::sqrt and use cuda::std::sqrt instead
| struct multi_gpu_engine_t { | ||
| // Constructs shards from rank_data. The global (unpartitioned) problem is | ||
| // read straight from `mps`; each shard slices out the entries it owns. | ||
| multi_gpu_engine_t(std::vector<rank_data_t<i_t, f_t>>&& rank_data, |
There was a problem hiding this comment.
Was there a specific reason to use && there and not just &?
| { | ||
| for (auto& s : shards) { | ||
| raft::device_setter guard(s->device_id); | ||
| fn(*s); |
There was a problem hiding this comment.
Very cool! What is the behavior when fn is asynchronous from a GPU perspective? Will the asynchronous kernels launched within fn be correctly run on the device set above even if the calling thread has left the scope?
| auto& sub = *shard.sub_pdlp; | ||
| // turns the Tuple of lambdas into a tuple of rmm::device_uvector | ||
| auto cub_inputs = std::apply( | ||
| [&sub](auto&... acc) { return cuda::std::make_tuple(acc(sub)...); }, in_accessors); |
There was a problem hiding this comment.
Not sure I'm following, where are we turning things into rmm::device_uvector? Also can't we directly wrap things in a cuda::std::make_tuple?
|
/ok to test d8a1fa8 |
…tings rather than hyper_params
|
--use-distributed-pdlp argument not recognized when I build cuOpt and run cuopt_cli |
|
/ok to test 1a5b941 |
|
/ok to test e267972 |
|
all CLI arguments work now, as validated on B200 with up to 8 GPUs. |
|
/ok to test 368b3b3 |
|
/ok to test 6948bc5 |
|
/ok to test 1563cdc |
|
/ok to test d0de284 |
Implemented metis-partitionned multi-GPU PDLP.
To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with--distributed-pdlp-num-gpus nAll benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet
Here is the bottom line of the results
On 8 NVLINKed B200 :
against CuOpt :
against D-PDLP
to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work
closes #891