[slurm-users] Mysterious job terminations on Slurm 17.11.10

Andy Riebs Thu, 31 Jan 2019 11:08:13 -0800

Hi All,

Just checking to see if this sounds familiar to anyone.


Environment:
- CentOS 7.5 x86_64
- Slurm 17.11.10 (but this also happened with 17.11.5)

We typically run about 100 tests/night, selected from a handful offavorites. For roughly 1 in 300 test runs, we see one of two mysteriousfailures:


1. The 5 minute cancellation

A job will be rolling along, generating it's expected output, and thenthis message appears:


   srun: forcing job termination
   srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
   slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
   2019-01-30T07:35:50 ***
   srun: error: nodename: task 250: Terminated
   srun: Terminating job step 3531.0

sacct reports

           JobID               Start                 End ExitCode     
   State
   ------------ ------------------- ------------------- --------
   ----------
   3418         2019-01-29T05:54:07 2019-01-29T05:59:16 0:9     FAILED

These failures consistently happen at just about 5 minutes into the runwhen they happen.


2. The random cancellation

As above, a job will be generating the expected output, and then we see

   srun: forcing job termination
   srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
   slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
   2019-01-30T07:35:50 ***
   srun: error: nodename: task 250: Terminated
   srun: Terminating job step 3531.0

But this time, sacct reports

           JobID               Start                 End ExitCode     
   State
   ------------ ------------------- ------------------- --------
   ----------
   3531         2019-01-30T07:21:25 2019-01-30T07:35:50      0:0 COMPLETED
   3531.0       2019-01-30T07:21:27 2019-01-30T07:35:56     0:15 CANCELLED

I think we've seen these cancellations pop up as soon as a minute or twointo the test run, up to perhaps 20 minutes into the run.

The only thing slightly unusual in our job submissions is that we usesrun's "--immediate=120" so that the scripts can respond appropriatelyif a node goes down.

With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue inthe slurmctld or slurmd logs.


Any thoughts on what might be happening, or what I might try next?

Andy

--
Andy Riebs
[email protected]
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

[slurm-users] Mysterious job terminations on Slurm 17.11.10

Reply via email to