Fix zmq and ROSS compilation issues by sanjaychari · Pull Request #240 · codes-org/codes

sanjaychari · 2026-05-21T14:36:52Z

The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq and CUDA. This PR changes it to be compatible with the master branch of ROSS and fixes the zeromq and CUDA compilation issues.

… for director function)

…Jacobi)

When calling the function `jobmap_list_to_local`, this would go through the entire list of IDs until it finds a matching ID. This is O(n) in the average and worst cases. For small networks, this won't take much time, so it never flared up as an issue. When running larger network simulations, at 8K nodes, there was a significant slowdown. This function was found, after extensive profiling, to be the principal culprit. The fix is simple, make a table where looking for an ID is O(1). A simple array does the trick. After running some experiments, there's a significant speedup of 30% for a network of 8448 with a job using all nodes. The job was uniform random and the simulation was run for 10ms (virtual time).

This bug was introduced when building the network surrogate. To build the surrogate, we need to track the input queue "size" (the input message queue to the routers from the workloads). If the network surrogate wouldn't live down in specific network models (it has been implemented right now only on dragonfly-dally), it should actually reside within the model-net layer, and thus, individual models shouldn't need to track the state of the input queue. Hopefully, we can move the network surrogate from dragonfly-dally into model-net.

The configuration file should be of the form: > %d %d %d %f where each value corresponds to > job_id skip_at_iter resume_at_iter time_per_iter The configuration file is passed through the --skipping-iterations-file parameter.

Reading data from `skipping_iterations_file` happens at two stages, first we find how much data to load into memory, then we malloc the space and load the data. One extra row of data had been loaded, which overwrote a couple of bytes for some other structure. This ocassionally would mean a segfault (which only showed up when running the simulation in parallel).

The struct nw_message was messy. It kept on getting longer and longer as more and more values were stored in the struct to use later for rollback. Now, it is more managable and it uses less memory than before.

…p_type

…network if needed

…ke a different name The idea of this change is to be able to have a configuration file like: ``` 20 milc1 1 0 15 conceptual-jacobi3d-5 1 0 ``` While the workload_json_files allow us to tell CODES where to look for the json configuration files: ``` milc1 path-to/milc1.json conceptual-jacobi3d-5 path-to/my-conceptual-jacobi3d.json ```

Replaced fscanf loop with fgets/sscanf to handle trailing newlines consistently across systems (this bug was silently showing up in the GHC200 system). Also added error reporting for malformed lines. btw, this code was written by Claude and audited by me ;)

This merge brings three major changes: - The hardening of the reverse handlers and thus the removal of all non-determinism - The full implementation of an application director for mpi-replay, so that simulations can be accelerated - The connection of the old network surrogate to the application surrogate

Autoconf is now far too outdated and keeping it on synch with the changes made in the CMakefile

New stable CODES version

The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq. This commit changes it to be compatible with the master branch of ROSS and fixes the zeromq compilation issues.

Compilation with torch-jit was not occuring even with torch_enable set to 1. This commit fixes torch-jit compilation with GPU support.

sanjaychari · 2026-05-21T18:09:38Z

cc @caitlinross @kevinabrown @carothersc

sanjaychari · 2026-05-21T19:07:59Z

I ran a sequential simulation with a dummy PyTorch checkpoint file and this code works for sequential simulation. Conservative and optimistic simulation have some issues with GVT consistency but that might be solved by an accurate ML model, or in a separate pull request.

sanjaychari · 2026-05-21T20:47:31Z

The GVT consistency issues with optimistic mode were happening because of network_treatment_on_switch being set to "freeze" in the CODES conf file. Events from the ML model were scheduled to arrive before GVT and were sent without any delay when received by the PDES simulation after the switch, and ROSS reported these as stragglers.

Changing network_treatment_on_switch to "nothing" fixes the issue.

This commit makes this branch compatible with the master branch and introduces ML modelling code to be used with the director.

caitlinross · 2026-05-22T18:49:20Z

i'm good with merging this

helq added 30 commits February 26, 2024 18:42

Refactoring director function to generalize (first step to define API…

6a67816

… for director function)

Hardcoded example skipping iterations for TWO applications (MILC and …

9b32a71

…Jacobi)

Improving figure generation script

42f7cd5

Injecting iteration time as an argument

c589d49

Merge remote-tracking branch 'origin/kronos-develop' into kronos-develop

54936f2

Updating code after ROSS change on gvt hook

1df7bb7

Removing hardcoded test and we can pass a config file now

472cc5a

The configuration file should be of the form: > %d %d %d %f where each value corresponds to > job_id skip_at_iter resume_at_iter time_per_iter The configuration file is passed through the --skipping-iterations-file parameter.

Allowing to run without skipping configuration file

2711b6b

Saving apps iteration logs into single files per PE

1412a4e

Guaranteeing that "workload period" config works in parallel

bb5b369

Changing time in period file to double (from long)

a4e052a

Stdout for surrogate only from PE 0

795628d

Implementing custom LP status printing for model-net-lps

a7121ec

Fixing small bug found when rollbacking model-net-event

ca30320

Cleaning up some structs and fixing a reverse handler case

c2afcd1

Refactoring struct in model-net-mpi-replay

c4c1491

The struct nw_message was messy. It kept on getting longer and longer as more and more values were stored in the struct to use later for rollback. Now, it is more managable and it uses less memory than before.

Print function for struct codes_workload_op and enum codes_workload_o…

9a5bf98

…p_type

Implementing deep copy/check/print for LP state: nw_state

9da3d36

Fixing minor reversibility bugs in LP type nw_state

6e97889

Adding checkpointer functionality to model-net sub-models

a3e638e

Moving implementation of linked list equality to quicklist.h

e430fea

Fixing some potential memory errors (from Valgrind)

8b95a70

Extending implementation of model-net checkpointer

c9729d8

Implementing FCFS checkpointer

7bc29c2

Removing never used struct param entry_time

d48898a

Printing lp states and events with a prefix (prettier printing)

fab09e8

helq and others added 21 commits June 18, 2025 04:43

Refactoring routers usage of custom double linked-list for qlist

3d1b55c

Allowing director to be called after simulation ended, to repopulate …

274f020

…network if needed

Updating README and compile instructions

ba7b826

Small print changes

07a4002

Allowing conc-online to load json files from config path

8be98f9

Extending iterator predictor to predict when to restart the simulation

64c6cce

Making post_init_share_ending_iteration intent clearer

b992e4a

Fixing some errors found with valgrind

3295653

Updating CODES-compile-instructions.sh

73cdbd5

Saving to file when an iteration has been skipped by the surrogate

667dc28

Updating compilation instructions

789c469

Max iteration per app should be computed across all MPI ranks

45453ad

Updating compilation instructions

242707e

Removing support for Autoconf

34275e3

Autoconf is now far too outdated and keeping it on synch with the changes made in the CMakefile

Adding some of Neil's and Elkin's contributions from the past 5 years

3d2b726

Updating compilation script

ed9edf5

Merge pull request codes-org#239 from codes-org/develop

4a055c2

New stable CODES version

Fix zmq and ROSS compilation issues

39fbc4f

The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq. This commit changes it to be compatible with the master branch of ROSS and fixes the zeromq compilation issues.

sanjaychari changed the title ~~Fix zmq and ROSS compilation issues~~ [WIP] Fix zmq and ROSS compilation issues May 21, 2026

Fix torch-jit compilation

0651b5e

Compilation with torch-jit was not occuring even with torch_enable set to 1. This commit fixes torch-jit compilation with GPU support.

Allow cpu-based PyTorch usage

01a2b16

sanjaychari changed the title ~~[WIP] Fix zmq and ROSS compilation issues~~ Fix zmq and ROSS compilation issues May 21, 2026

Master branch compatibility and ML models

63693c0

This commit makes this branch compatible with the master branch and introduces ML modelling code to be used with the director.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zmq and ROSS compilation issues#240

Fix zmq and ROSS compilation issues#240
sanjaychari wants to merge 116 commits into
codes-org:kronos-develop-director-bfrom
sanjaychari:kronos-develop-director-b

sanjaychari commented May 21, 2026 •

edited

Loading

Uh oh!

sanjaychari commented May 21, 2026 •

edited

Loading

Uh oh!

sanjaychari commented May 21, 2026

Uh oh!

sanjaychari commented May 21, 2026 •

edited

Loading

Uh oh!

caitlinross commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sanjaychari commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanjaychari commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanjaychari commented May 21, 2026

Uh oh!

sanjaychari commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caitlinross commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanjaychari commented May 21, 2026 •

edited

Loading

sanjaychari commented May 21, 2026 •

edited

Loading

sanjaychari commented May 21, 2026 •

edited

Loading