Re: [slurm-users] running mpi from inside an mpi job

2023-06-20 Thread David Schanzenbach
Hi Mike, What version of Slurm are you using? If you are running a version of Slurm 20.11.x or newer,  a change in the scheduler behavior was made so that by default srun will not allow resources to be overlapped by job steps. https://bugs.schedmd.com/show_bug.cgi?id=11863#c3 I would see if a

Re: [slurm-users] Aborting a job from inside the prolog

2023-06-20 Thread Gerhard Strangar
Alexander Grund wrote: > Although it may be better to not drain it, I'm a bit nervous with "exit > 0" as it is very important that the job does not start/continue, i.e. > the user code (sbatch script/srun) is never executed in that case. > So I want to be sure that an `scancel` on the job in its

[slurm-users] running mpi from inside an mpi job

2023-06-20 Thread Vanhorn, Mike
I have a user who is submitting a job to slurm which requests 16 tasks, i.e. #SBATCH --ntasks 16 #SBATCH –cpus-per-task 1 The slurm script runs an mpi program called Parent.mpi, which then (fails to) call 15 mpi child processes. He’s tried two different ways for the parent to spawn the childre

[slurm-users] spank_job_control_setenv() doesn't seem to work for epilog scripts

2023-06-20 Thread Alberto Miranda
Dear all, I am writing a SPANK plugin to propagate several environment variables to `Prolog` and `Epilog` scripts but, so far, we have only managed to do so successfully for the former. In our SPANK plugin we capture some information from the user using additional CLI flags and then export t

Re: [slurm-users] Aborting a job from inside the prolog

2023-06-20 Thread Alexander Grund
Am 19.06.23 um 17:32 schrieb Gerhard Strangar: Try to exit with 0, because it's not your prolog that failed. That seemingly works. I do see a value in exiting with 1 to drain the node to investigate why/what has exactly failed. Although it may be better to not drain it, I'm a bit nervous wit

Re: [slurm-users] GPU Gres Type inconsistencies

2023-06-20 Thread Ben Roberts
For the benefit of anyone else who comes across this, I've managed to resolve the issue. 1. Remove the affected node entries from the slurm.conf on slurmctld host 2. Restart slurmctld 3. Re-add the nodes back to slurm.conf on slurmctld host 4. Restart slurmctld again Following this,