Fix zmq and ROSS compilation issues#240
Open
sanjaychari wants to merge 116 commits into
Open
Conversation
… for director function)
When calling the function `jobmap_list_to_local`, this would go through the entire list of IDs until it finds a matching ID. This is O(n) in the average and worst cases. For small networks, this won't take much time, so it never flared up as an issue. When running larger network simulations, at 8K nodes, there was a significant slowdown. This function was found, after extensive profiling, to be the principal culprit. The fix is simple, make a table where looking for an ID is O(1). A simple array does the trick. After running some experiments, there's a significant speedup of 30% for a network of 8448 with a job using all nodes. The job was uniform random and the simulation was run for 10ms (virtual time).
This bug was introduced when building the network surrogate. To build the surrogate, we need to track the input queue "size" (the input message queue to the routers from the workloads). If the network surrogate wouldn't live down in specific network models (it has been implemented right now only on dragonfly-dally), it should actually reside within the model-net layer, and thus, individual models shouldn't need to track the state of the input queue. Hopefully, we can move the network surrogate from dragonfly-dally into model-net.
This bug was introduced when building the network surrogate. To build the surrogate, we need to track the input queue "size" (the input message queue to the routers from the workloads). If the network surrogate wouldn't live down in specific network models (it has been implemented right now only on dragonfly-dally), it should actually reside within the model-net layer, and thus, individual models shouldn't need to track the state of the input queue. Hopefully, we can move the network surrogate from dragonfly-dally into model-net.
The configuration file should be of the form: > %d %d %d %f where each value corresponds to > job_id skip_at_iter resume_at_iter time_per_iter The configuration file is passed through the --skipping-iterations-file parameter.
Reading data from `skipping_iterations_file` happens at two stages, first we find how much data to load into memory, then we malloc the space and load the data. One extra row of data had been loaded, which overwrote a couple of bytes for some other structure. This ocassionally would mean a segfault (which only showed up when running the simulation in parallel).
The struct nw_message was messy. It kept on getting longer and longer as more and more values were stored in the struct to use later for rollback. Now, it is more managable and it uses less memory than before.
…network if needed
…ke a different name The idea of this change is to be able to have a configuration file like: ``` 20 milc1 1 0 15 conceptual-jacobi3d-5 1 0 ``` While the workload_json_files allow us to tell CODES where to look for the json configuration files: ``` milc1 path-to/milc1.json conceptual-jacobi3d-5 path-to/my-conceptual-jacobi3d.json ```
Replaced fscanf loop with fgets/sscanf to handle trailing newlines consistently across systems (this bug was silently showing up in the GHC200 system). Also added error reporting for malformed lines. btw, this code was written by Claude and audited by me ;)
This merge brings three major changes: - The hardening of the reverse handlers and thus the removal of all non-determinism - The full implementation of an application director for mpi-replay, so that simulations can be accelerated - The connection of the old network surrogate to the application surrogate
Autoconf is now far too outdated and keeping it on synch with the changes made in the CMakefile
New stable CODES version
The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq. This commit changes it to be compatible with the master branch of ROSS and fixes the zeromq compilation issues.
Compilation with torch-jit was not occuring even with torch_enable set to 1. This commit fixes torch-jit compilation with GPU support.
Author
Author
|
I ran a sequential simulation with a dummy PyTorch checkpoint file and this code works for sequential simulation. Conservative and optimistic simulation have some issues with GVT consistency but that might be solved by an accurate ML model, or in a separate pull request. |
Author
|
The GVT consistency issues with optimistic mode were happening because of network_treatment_on_switch being set to "freeze" in the CODES conf file. Events from the ML model were scheduled to arrive before GVT and were sent without any delay when received by the PDES simulation after the switch, and ROSS reported these as stragglers. Changing network_treatment_on_switch to "nothing" fixes the issue. |
This commit makes this branch compatible with the master branch and introduces ML modelling code to be used with the director.
Member
|
i'm good with merging this |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The kronos-develop-director-b branch of CODES was using an outdated version of ROSS and also had compilation issues because of zeromq and CUDA. This PR changes it to be compatible with the master branch of ROSS and fixes the zeromq and CUDA compilation issues.