Skip to content

Fix zmq and ROSS compilation issues#240

Open
sanjaychari wants to merge 116 commits into
codes-org:kronos-develop-director-bfrom
sanjaychari:kronos-develop-director-b
Open

Fix zmq and ROSS compilation issues#240
sanjaychari wants to merge 116 commits into
codes-org:kronos-develop-director-bfrom
sanjaychari:kronos-develop-director-b

Conversation

@sanjaychari
Copy link
Copy Markdown

@sanjaychari sanjaychari commented May 21, 2026

The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq and CUDA. This PR changes it to be compatible with the master branch of ROSS and fixes the zeromq and CUDA compilation issues.

helq added 30 commits February 26, 2024 18:42
When calling the function `jobmap_list_to_local`, this would go through
the entire list of IDs until it finds a matching ID. This is O(n) in the
average and worst cases. For small networks, this won't take much time,
so it never flared up as an issue. When running larger network
simulations, at 8K nodes, there was a significant slowdown. This
function was found, after extensive profiling, to be the principal
culprit.

The fix is simple, make a table where looking for an ID is O(1). A
simple array does the trick. After running some experiments, there's a
significant speedup of 30% for a network of 8448 with a job using all
nodes. The job was uniform random and the simulation was run for 10ms
(virtual time).
This bug was introduced when building the network surrogate. To build
the surrogate, we need to track the input queue "size" (the input
message queue to the routers from the workloads).

If the network surrogate wouldn't live down in specific network models
(it has been implemented right now only on dragonfly-dally), it should
actually reside within the model-net layer, and thus, individual models
shouldn't need to track the state of the input queue.

Hopefully, we can move the network surrogate from dragonfly-dally into
model-net.
This bug was introduced when building the network surrogate. To build
the surrogate, we need to track the input queue "size" (the input
message queue to the routers from the workloads).

If the network surrogate wouldn't live down in specific network models
(it has been implemented right now only on dragonfly-dally), it should
actually reside within the model-net layer, and thus, individual models
shouldn't need to track the state of the input queue.

Hopefully, we can move the network surrogate from dragonfly-dally into
model-net.
The configuration file should be of the form:
> %d %d %d %f
where each value corresponds to
> job_id skip_at_iter resume_at_iter time_per_iter

The configuration file is passed through the --skipping-iterations-file
parameter.
Reading data from `skipping_iterations_file` happens at two stages,
first we find how much data to load into memory, then we malloc the
space and load the data. One extra row of data had been loaded, which
overwrote a couple of bytes for some other structure. This ocassionally
would mean a segfault (which only showed up when running the simulation
in parallel).
The struct nw_message was messy. It kept on getting longer and longer as
more and more values were stored in the struct to use later for
rollback. Now, it is more managable and it uses less memory than before.
helq and others added 21 commits June 18, 2025 04:43
…ke a different name

The idea of this change is to be able to have a configuration file like:

```
20 milc1 1 0
15 conceptual-jacobi3d-5 1 0
```

While the workload_json_files allow us to tell CODES where to look for
the json configuration files:

```
milc1 path-to/milc1.json
conceptual-jacobi3d-5 path-to/my-conceptual-jacobi3d.json
```
Replaced fscanf loop with fgets/sscanf to handle trailing newlines
consistently across systems (this bug was silently showing up in the
GHC200 system). Also added error reporting for malformed
lines.

btw, this code was written by Claude and audited by me ;)
This merge brings three major changes:
- The hardening of the reverse handlers and thus the removal of all
  non-determinism
- The full implementation of an application director for mpi-replay, so
  that simulations can be accelerated
- The connection of the old network surrogate to the application
  surrogate
Autoconf is now far too outdated and keeping it on synch with the
changes made in the CMakefile
The kronos-develop-director-b branch of CODES
was using an outdated version of ROSS and also
had compilation issues because of zeromq. This
commit changes it to be compatible with the master
branch of ROSS and fixes the zeromq compilation
issues.
@sanjaychari sanjaychari changed the title Fix zmq and ROSS compilation issues [WIP] Fix zmq and ROSS compilation issues May 21, 2026
Compilation with torch-jit was not occuring even with torch_enable set to 1.
This commit fixes torch-jit compilation with GPU support.
@sanjaychari
Copy link
Copy Markdown
Author

sanjaychari commented May 21, 2026

@sanjaychari sanjaychari changed the title [WIP] Fix zmq and ROSS compilation issues Fix zmq and ROSS compilation issues May 21, 2026
@sanjaychari
Copy link
Copy Markdown
Author

I ran a sequential simulation with a dummy PyTorch checkpoint file and this code works for sequential simulation. Conservative and optimistic simulation have some issues with GVT consistency but that might be solved by an accurate ML model, or in a separate pull request.

@sanjaychari
Copy link
Copy Markdown
Author

sanjaychari commented May 21, 2026

The GVT consistency issues with optimistic mode were happening because of network_treatment_on_switch being set to "freeze" in the CODES conf file. Events from the ML model were scheduled to arrive before GVT and were sent without any delay when received by the PDES simulation after the switch, and ROSS reported these as stragglers.

Changing network_treatment_on_switch to "nothing" fixes the issue.

This commit makes this branch compatible with the master branch
and introduces ML modelling code to be used with the director.
@caitlinross
Copy link
Copy Markdown
Member

i'm good with merging this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants