Skip to content

[feat] Aggregate polling for retrieving job pending reason#3642

Open
vkarak wants to merge 3 commits intoreframe-hpc:developfrom
vkarak:bugfix/slurm-rpc-load
Open

[feat] Aggregate polling for retrieving job pending reason#3642
vkarak wants to merge 3 commits intoreframe-hpc:developfrom
vkarak:bugfix/slurm-rpc-load

Conversation

@vkarak
Copy link
Copy Markdown
Contributor

@vkarak vkarak commented Mar 25, 2026

This PR improves the polling of jobs for the pending reason.

This is now done in a single command for all pending jobs. Two knobs are also exposed to users now as configuration options and environment variables:

  1. slurm_job_cancel_reasons: This is a list of pending reasons that reframe will check and will cancel the job proactively.
  2. slurm_pending_job_reason_poll_freq: This controls the frequency that pending jobs will be polled for their pending reasons (valid only for slurm backend).

Closes #3640.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 50.84746% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.87%. Comparing base (4f021cf) to head (8881baf).

Files with missing lines Patch % Lines
reframe/core/schedulers/slurm.py 50.00% 27 Missing ⚠️
reframe/core/schedulers/__init__.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3642      +/-   ##
===========================================
+ Coverage    91.69%   91.87%   +0.18%     
===========================================
  Files           62       62              
  Lines        13745    13755      +10     
===========================================
+ Hits         12603    12638      +35     
+ Misses        1142     1117      -25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vkarak vkarak force-pushed the bugfix/slurm-rpc-load branch from 61c38ae to 45c5613 Compare March 25, 2026 16:52
@jack-morrison jack-morrison self-requested a review April 21, 2026 13:48
Comment thread docs/config_reference.rst Outdated
Comment thread docs/config_reference.rst Outdated
Comment thread docs/config_reference.rst

If a job associated to a test is in pending state with one of the reasons listed here, ReFrame will cancel the job.

This option is relevant for the Slurm backends only.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This option is relevant for the Slurm backends only.
This option is relevant for the Slurm backend only.

Copy link
Copy Markdown
Contributor Author

@vkarak vkarak Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the plural here, because there are two Slurm backends: slurm and squeue.

Comment on lines -128 to -136
self._cancel_reasons = ['FrontEndDown',
'Licenses', # May require sysadmin
'NodeDown',
'PartitionDown',
'PartitionInactive',
'PartitionNodeLimit',
'QOSJobLimit',
'QOSResourceLimit',
'QOSUsageThreshold']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we dropping these default cancel reasons? Should they be made a part of slurm_job_cancel_reasons?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but this list is not so accurate and to maintain it, we would have to deal with detailed Slurm semantics. For example, the Licenses reason is not a cancel reason in reality. Previously, we were poking Slurm to check if the reported nodes in the reason were indeed unavailable, which was not so efficient. That's why, I'm only included ReqNodeNotAvail reason as a cancel, which is indeed one, and leave any additional reason at user discretion.

Comment thread reframe/core/schedulers/slurm.py Outdated
@github-project-automation github-project-automation Bot moved this from Todo to In Progress in ReFrame Backlog Apr 21, 2026
vkarak and others added 3 commits April 24, 2026 22:37
Co-authored-by: Jack Morrison <jack.morrison@cornelisnetworks.com>
Signed-off-by: Vasileios Karakasis <vkarak@gmail.com>
@vkarak vkarak force-pushed the bugfix/slurm-rpc-load branch from b2c3e02 to 8881baf Compare April 24, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Multiple squeue's in _cancel_if_blocked in reframe/core/schedulers/slurm.py are hitting slurm's RPC rate limit

2 participants