[feat] Aggregate polling for retrieving job pending reason#3642
[feat] Aggregate polling for retrieving job pending reason#3642vkarak wants to merge 3 commits intoreframe-hpc:developfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3642 +/- ##
===========================================
+ Coverage 91.69% 91.87% +0.18%
===========================================
Files 62 62
Lines 13745 13755 +10
===========================================
+ Hits 12603 12638 +35
+ Misses 1142 1117 -25 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
61c38ae to
45c5613
Compare
|
|
||
| If a job associated to a test is in pending state with one of the reasons listed here, ReFrame will cancel the job. | ||
|
|
||
| This option is relevant for the Slurm backends only. |
There was a problem hiding this comment.
| This option is relevant for the Slurm backends only. | |
| This option is relevant for the Slurm backend only. |
There was a problem hiding this comment.
I would keep the plural here, because there are two Slurm backends: slurm and squeue.
| self._cancel_reasons = ['FrontEndDown', | ||
| 'Licenses', # May require sysadmin | ||
| 'NodeDown', | ||
| 'PartitionDown', | ||
| 'PartitionInactive', | ||
| 'PartitionNodeLimit', | ||
| 'QOSJobLimit', | ||
| 'QOSResourceLimit', | ||
| 'QOSUsageThreshold'] |
There was a problem hiding this comment.
Are we dropping these default cancel reasons? Should they be made a part of slurm_job_cancel_reasons?
There was a problem hiding this comment.
I thought about it, but this list is not so accurate and to maintain it, we would have to deal with detailed Slurm semantics. For example, the Licenses reason is not a cancel reason in reality. Previously, we were poking Slurm to check if the reported nodes in the reason were indeed unavailable, which was not so efficient. That's why, I'm only included ReqNodeNotAvail reason as a cancel, which is indeed one, and leave any additional reason at user discretion.
Co-authored-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Vasileios Karakasis <vkarak@gmail.com>
b2c3e02 to
8881baf
Compare
This PR improves the polling of jobs for the pending reason.
This is now done in a single command for all pending jobs. Two knobs are also exposed to users now as configuration options and environment variables:
slurm_job_cancel_reasons: This is a list of pending reasons that reframe will check and will cancel the job proactively.slurm_pending_job_reason_poll_freq: This controls the frequency that pending jobs will be polled for their pending reasons (valid only forslurmbackend).Closes #3640.