Hello Slurm experts, We have a workflow where we have a script which invoke salloc —noshell and then launches a series of MPI jobs using srun with the jobid= option to make use of the reservation we got from the salloc invocation. We are needing to do things this way because the script itself needs to report back the results of the tests to an external server running at AWS. The compute nodes within the allocated partition have no connectivity to the internet, hence our use of the —noshell option.
This is all fine except for an annoying behavior of slurm. If we have no test failures, I.e. all srun’ed tests exist successfully everything works fine. However, once we start having failed tests, and hence non zero status return from srun, we maybe get one or two tests to run, and then slurm cancels the reservation. Here’s an example output from the script as its running some MPI tests, then some fail, then slurm drops our reservation: ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/leakcatch stdout: seed value: -219475876 stdout: 0 stdout: 1 stdout: 2 stdout: 3 stdout: 4 stdout: 5 stdout: 6 stdout: 7 stdout: 8 stdout: 9 stdout: 10 stdout: 11 stdout: 12 stdout: 13 stdout: 14 stdout: 15 stdout: 16 stdout: 17 stdout: 18 stdout: 19 stdout: 20 stdout: ERROR: buf 778 element 749856 is 103 should be 42 stderr: -------------------------------------------------------------------------- stderr: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD stderr: with errorcode 16. stderr: stderr: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. stderr: You may or may not see output from other processes, depending on stderr: exactly when Open MPI kills them. stderr: -------------------------------------------------------------------------- stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish. stderr: slurmstepd: error: *** STEP 2974.490 ON st03 CANCELLED AT 2019-01-22T20:02:22 *** stderr: srun: error: st03: task 0: Exited with exit code 16 stderr: srun: error: st03: tasks 1-15: Killed ExecuteCmd done ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/maxsoak stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish. stderr: slurmstepd: error: *** STEP 2974.491 ON st03 CANCELLED AT 2019-01-22T23:06:08 DUE TO TIME LI MIT *** stderr: srun: error: st03: tasks 0-15: Terminated ExecuteCmd done ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/op_commutative stderr: srun: error: Unable to allocate resources: Invalid job id specified ExecuteCmd done This is not due to the allocation being revoked due to a time limit, even though the message says such. The job had been running only about 30 minutes into a 3 hour reservation. We’ve double checked that and on one cluster which we can configure, we set the default job timelimit to infinite and still observe the issue. But the fact that SLURM is reporting its a TIMELIMIT thing may be hinting at what’s going on that SLURM revokes the allocation. We see this on every cluster we’ve tried so far, so it doesn’t appear to be a site-specific configuration issue. Any insights into how to workaround/fix this problem would be appreciated. Thanks, Howard -- Howard Pritchard B Schedule HPC-ENV Office 9, 2nd floor Research Park TA-03, Building 4200, Room 203 Los Alamos National Laboratory