Skip to content

Add reason to cancelation of a job#1074

Open
Dym03 wants to merge 1 commit intoIt4innovations:mainfrom
Dym03:cancel_with_reason
Open

Add reason to cancelation of a job#1074
Dym03 wants to merge 1 commit intoIt4innovations:mainfrom
Dym03:cancel_with_reason

Conversation

@Dym03
Copy link

@Dym03 Dym03 commented Mar 9, 2026

Adds a optional argument to hq job cancel <ID_Selector> --reason <String> requested in #1001.
Currently set to max 32 chars in the cli output mode.
With the change of the representation of the Job state, the json output has been also slightly changed to not just outputs if the Job is opened, but the state of the Job. I maybe find that the Job could have all of the states represented in the newly added JobState, and not transfered through the Status enum. But it is true the state might change quite often.

Preview of current state of the hq job list --all
image

@spirali
Copy link
Collaborator

spirali commented Mar 10, 2026

Thanks for the contribution.

I have the following comments.

Job is a kind of namespace for tasks. Persistent states such as WAITING/RUNNING/CANCELLED are states of tasks, not jobs. Jobs have only an inferred state (WAITING/RUNNING/CANCELLED) depending on the state of tasks within the job. Your change disrupts this, so now you can have a Job in a canceled state (your newly introduced state) while there are no canceled tasks within it. This is something unwanted. I suggest the following refactoring:

  • Leave is_open as a bool flag. Openness of a job is independent of the inferred state of the job. I may have an open job that has cancelled tasks.
  • Add into Job something like: cancel_message: Option<String> and set it when a job is canceled with a message.
  • Do not show "Cancellation reason" in the list of all jobs, but in the job detail, e.g. "hq job info 1", and only when something is actually filled in.
  • The length limit on the message is good, but make it more generous, e.g. 200 chars, and enforce it when the message is entered. So hq job cancel --reason='VERY LOOOOOONG....' should be an error.
  • Maybe rename --reason to --message (@Kobzol some bikeshedding opinion?)

Additionally:

  • Update the cancellation documentation in docs/jobs/jobs.md.
  • Update CHANGELOG.md.
  • Add a test in tests/test_job.py; you can get inspiration from tests like test_cancel_*.

Just thinking:

My original idea was to put the message into the task state (not the job state), because technically we can cancel different tasks for different reasons. But actually users usually have just a single reason to cancel a job, so it seems fine to have just one message for the whole job.

@Dym03
Copy link
Author

Dym03 commented Mar 10, 2026

Okey, will do :).
Yea the current state is inspired by the issue #1001, where someone mentioned, they would like to see the reason in the list -all, so thats why it is like this :). Okey, i thought if the Job is cancelled then it is it's final state, thats why i made it like that. But no problem with changing that, to keep the original idea :).
Also about the thought, as it is job cancel, i thought this should be connected to a job. Correct me if i am wrong, but can I cancel job more than once ? Also if a job can be composed of thousands of tasks, then i think to add a reason to each one of those tasks would not be necessary.
Thanks for the review and tips, will take a look at it.

@Kobzol
Copy link
Member

Kobzol commented Mar 10, 2026

I personally like --reason more than --message. I'd say that attaching the cancellation reason to the job itself is enough for a first functionality. If individual tasks are cancelled, there will be no message, only if the whole job is cancelled.

If a job is cancelled multiple times (this doesn't really make sense, but someone could do it in theory), we can either error out or overwrite any previous cancel reason.

@spirali
Copy link
Collaborator

spirali commented Mar 10, 2026

It is tricky because the cancellation predates open jobs. But in the current version you can do this:

$ hq job open
$ hq submit --job <JOB_ID> -- sleep 0
$ hq job cancel <JOB_ID>
$ hq job submit --job <JOB_ID> -- sleep 0
$ hq job cancel <JOB_ID>

So we create an open job, submit something, cancel the job (=cancel all current tasks within the task) but job is still open, submit new tasks into the same job. And we can cancel these new jobs again.

Open jobs makes all these things more complex, on the other hand, as HQ is optimized for having rather small number of jobs (and many tasks), we do not want encourage user to create a new job when they needs "just" canceling the tasks.

In the mentioned case with multiple cancellation, I am totally ok with overwriting the old message.

@spirali
Copy link
Collaborator

spirali commented Mar 10, 2026

Side note: I just realized that it also needs some updates in the journal, to not lost the message in case of server crash. We can postpone it to an additional PR, since it is nontrivial as current cancellation event tracks individual tasks, not whole jobs.

@Dym03
Copy link
Author

Dym03 commented Mar 10, 2026

Okey, makes more sense now what the open really means :).

I think the option of overwriting might be confusing at times, but as you mentioned, it doesn't happens that often so I think if it will be documented then it is not that big of a deal.

Thanks for the feedback

About the journaling, I can try to take a look at that while at it. But I will see how complex it will be :).

@spirali
Copy link
Collaborator

spirali commented Mar 10, 2026

I personally like --reason more than --message.

Ok. I do not have strong opinion on this.

@Dym03
Copy link
Author

Dym03 commented Mar 10, 2026

One last question/idea before i jump into it. Maybe there could be like a param to the $ hq job list --all --verbose that would also include the reason (other additions, if there would be any in the future) in the list. I personally would rather list once and see the status with the reason, than to list -> get id -> detail the job. That doesn't mean the reason wouldn't be included in the detail as well :)

@spirali
Copy link
Collaborator

spirali commented Mar 10, 2026

I have no problem with showing it in hq job list as long it is not default. Flag --verbose seems reasonable.

@Dym03 Dym03 force-pushed the cancel_with_reason branch 2 times, most recently from 1165aa5 to 41e58c5 Compare March 10, 2026 14:22
@Dym03
Copy link
Author

Dym03 commented Mar 10, 2026

I have added the --verbose flag to the hq job list to show the cancel reason. There might be a good reason to enable this flag only when --all or --filter is added, because to my understanding, without them, the canceled jobs are not listed. But let me know if this is wanted or not. The flag is now only used for the --ouput-format cli. Because i think that the Json can hold that one more line. And in the quite mode, I currently ignore it, because it would be kinda contradictory to set the verbose flag on.

I have also updated the hq job info to show the cancellation reason under the state.

I have added 2 test suits to test the added functionality of the hq job cancel. During the setup of the test suits, i have come up to some difficulties with setting up the enviroment. So i updated the readme, to add the extra step of the maturin develop, that I was missing.

Documentation was also updated, to atleast mention the new additions. But let me know if this is enough :).

I haven't yet looked into the journaling of the cancelation but during the implementation, i have come across on_job_close event, that might be a good guide to how add the on_job_cancel event.

Also seems like some of the tests have been broken, by the added changes, I will try to look at it asap. If you have some recomendations about it, it would sure help :). While I was testing it locally, I also ran into some troubles with the tests, but some of them were timeouts, so I thought, I was missing some crutial part locally, like pbs or slurm. But seems like that might not be the case

@Dym03 Dym03 force-pushed the cancel_with_reason branch from 41e58c5 to 6ac79eb Compare March 10, 2026 14:57
@Dym03
Copy link
Author

Dym03 commented Mar 10, 2026

There were a small bug, due to the removal of is_open. That was fixed in this last push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants