Hello Slurm experts,

We have a workflow where we have a script which invoke salloc —noshell and then 
launches a series of MPI
jobs using srun with the jobid= option to make use of the reservation we got 
from the salloc invocation.
We are needing to do things this way because the script itself needs to report 
back the results of the
tests to an external server running at AWS.  The compute nodes within the 
allocated partition have no connectivity
to the internet, hence our use of the —noshell option.

This is all fine except for an annoying behavior of slurm.  If we have no test 
failures, I.e. all srun’ed tests
exist successfully everything works fine.  However, once we start having failed 
tests, and hence non zero
status return from srun, we maybe get one or two tests to run, and then slurm 
cancels the reservation.

Here’s an example output from the script as its running some MPI tests, then 
some fail, then slurm drops
our reservation:


ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 
/users/foobar/runInAllocMTT/mtt/masterWa

lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/leakcatch

stdout: seed value: -219475876

stdout: 0

stdout: 1

stdout: 2

stdout: 3

stdout: 4

stdout: 5

stdout: 6

stdout: 7

stdout: 8

stdout: 9

stdout: 10

stdout: 11

stdout: 12

stdout: 13

stdout: 14

stdout: 15

stdout: 16

stdout: 17

stdout: 18

stdout: 19

stdout: 20

stdout: ERROR: buf 778 element 749856 is 103 should be 42

stderr: 
--------------------------------------------------------------------------

stderr: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

stderr: with errorcode 16.

stderr:

stderr: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

stderr: You may or may not see output from other processes, depending on

stderr: exactly when Open MPI kills them.

stderr: 
--------------------------------------------------------------------------

stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

stderr: slurmstepd: error: *** STEP 2974.490 ON st03 CANCELLED AT 
2019-01-22T20:02:22 ***

stderr: srun: error: st03: task 0: Exited with exit code 16

stderr: srun: error: st03: tasks 1-15: Killed

ExecuteCmd done

ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 
/users/foobar/runInAllocMTT/mtt/masterWa

lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/maxsoak

stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

stderr: slurmstepd: error: *** STEP 2974.491 ON st03 CANCELLED AT 
2019-01-22T23:06:08 DUE TO TIME LI

MIT ***

stderr: srun: error: st03: tasks 0-15: Terminated

ExecuteCmd done

ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 
/users/foobar/runInAllocMTT/mtt/masterWa

lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/op_commutative

stderr: srun: error: Unable to allocate resources: Invalid job id specified

ExecuteCmd done

This is not due to the allocation being revoked due to a time limit, even 
though the message says such.  The job had been running only about 30 minutes
into a 3 hour reservation.   We’ve double checked that and on one cluster which 
we can configure, we set the default
job timelimit to infinite and still observe the issue.  But the fact that SLURM 
is reporting its a TIMELIMIT thing may be hinting at what’s going on that
SLURM revokes the allocation.

We see this on every cluster we’ve tried so far, so it doesn’t appear to be a 
site-specific configuration issue.

Any insights into how to workaround/fix this problem would be appreciated.

Thanks,

Howard


--
Howard Pritchard
B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203
Los Alamos National Laboratory

Reply via email to