We recently upgraded from Slurm 19.05.8 to 20.11.3. In our configuration, we have an interruptible partition named 'interruptible' for long-running, low-priority jobs that use checkpoint/restart. Jobs that are preempted would be killed and requeued rather than suspended. This configuration has been working without issue for 2+ years without issue.

After the upgrade, this has stopped working. Preempted jobs are killed and not requeued. My slurm.conf file is configured to requeue preempted jobs:

$ grep -i requeue /etc/slurm/slurm.conf
#JobRequeue=1
PreemptMode=Requeue

And the user's sbatch script included the --requeue option.

The user reports the err output from his preempted jobs now says

slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 2021-02-25T16:07:48 ***

And in the past it would see PREEMPTED instead of cancelled.

Any ideas what would cause this? I've reported this to Slurm support, and haven't gotten anything back yet, so I figured I'd ask here, too. If this is a bug, I can't be the only one who has experienced this.

--
Prentice

Reply via email to