My site recently updated to Slurm 21.08.6 and for the most part everything went 
fine.  Two Ubuntu nodes however are having issues.    Slurmd cannot execve the 
jobs on the nodes.  As an example:

[jrlang@tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell 
--nodelist=mdgx01 --partition=dgx /bin/bash
salloc: Granted job allocation 2328489
[jrlang@tmgt1 ~]$ srun hostname
srun: error: task 0 launch failed: Slurmd could not execve job
srun: error: task 1 launch failed: Slurmd could not execve job
srun: error: task 2 launch failed: Slurmd could not execve job
srun: error: task 3 launch failed: Slurmd could not execve job
srun: error: task 4 launch failed: Slurmd could not execve job
srun: error: task 5 launch failed: Slurmd could not execve job
srun: error: task 6 launch failed: Slurmd could not execve job
srun: error: task 7 launch failed: Slurmd could not execve job
srun: error: task 8 launch failed: Slurmd could not execve job
srun: error: task 9 launch failed: Slurmd could not execve job
srun: error: task 10 launch failed: Slurmd could not execve job
srun: error: task 11 launch failed: Slurmd could not execve job
srun: error: task 12 launch failed: Slurmd could not execve job
srun: error: task 13 launch failed: Slurmd could not execve job
srun: error: task 14 launch failed: Slurmd could not execve job
srun: error: task 15 launch failed: Slurmd could not execve job
srun: error: task 16 launch failed: Slurmd could not execve job
srun: error: task 17 launch failed: Slurmd could not execve job
srun: error: task 18 launch failed: Slurmd could not execve job
srun: error: task 19 launch failed: Slurmd could not execve job

Looking in slurmd-mdgx01.log we only see

[2022-03-24T14:44:02.408] [2328501.interactive] error: Failed to invoke task 
plugins: one of task_p_pre_setuid functions returned error
[2022-03-24T14:44:02.409] [2328501.interactive] error: job_manager: exiting 
abnormally: Slurmd could not execve job
[2022-03-24T14:44:02.411] [2328501.interactive] done with job


Note that this issues didn't occure with Slurm 20.11.8.

Any ideas what could be causing the issue, cause I'm stumped?

Jeff

Reply via email to