Hi Mike,
What version of Slurm are you using?
If you are running a version of Slurm 20.11.x or newer, a change in the
scheduler behavior was made so that by default srun will not allow
resources to be overlapped by job steps.
https://bugs.schedmd.com/show_bug.cgi?id=11863#c3
I would see if a
Alexander Grund wrote:
> Although it may be better to not drain it, I'm a bit nervous with "exit
> 0" as it is very important that the job does not start/continue, i.e.
> the user code (sbatch script/srun) is never executed in that case.
> So I want to be sure that an `scancel` on the job in its
I have a user who is submitting a job to slurm which requests 16 tasks, i.e.
#SBATCH --ntasks 16
#SBATCH –cpus-per-task 1
The slurm script runs an mpi program called Parent.mpi, which then (fails to)
call 15 mpi child processes. He’s tried two different ways for the parent to
spawn the childre
Dear all,
I am writing a SPANK plugin to propagate several environment variables
to `Prolog` and `Epilog` scripts but, so far, we have only managed to do
so successfully for the former.
In our SPANK plugin we capture some information from the user using
additional CLI flags and then export t
Am 19.06.23 um 17:32 schrieb Gerhard Strangar:
Try to exit with 0, because it's not your prolog that failed.
That seemingly works.
I do see a value in exiting with 1 to drain the node to investigate
why/what has exactly failed.
Although it may be better to not drain it, I'm a bit nervous wit
For the benefit of anyone else who comes across this, I've managed to resolve
the issue.
1. Remove the affected node entries from the slurm.conf on slurmctld host
2. Restart slurmctld
3. Re-add the nodes back to slurm.conf on slurmctld host
4. Restart slurmctld again
Following this,