Hi

One of my users reported a job cancelled before it completed. She got this:

"
slurmstepd: *** JOB 390031 ON bigger4 CANCELLED AT 2020-05-18T22:27:04 ***
"

The job was apparently cancelled by root:

"
sacct -j 390031 --format="jobid,state%30"
       JobID                          State
------------ ------------------------------
390031                       CANCELLED by 0
390031.batch                      CANCELLED
"
"

I can only find this in the logs:

"
[2020-05-18T22:27:03.954] debug2: _slurm_rpc_dump_partitions, size=542 usec=87 [2020-05-18T22:27:04.032] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 390031 uid 0 [2020-05-18T22:27:04.032] debug3: User (null)(1501) doesn't have a default account [2020-05-18T22:27:04.032] debug3: cons_res: _rm_job_from_res: job 390031 action 0 [2020-05-18T22:27:04.032] debug3: cons_res: removed job 390031 from part HPC row 0 [2020-05-18T22:27:04.032] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB [2020-05-18T22:27:04.033] _job_signal: 9 of running JobID=390031 State=0x8004 NodeCnt=4 successful 0x8004
...
[2020-05-18T22:27:19.143] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=390031 [2020-05-18T22:27:19.143] job_complete: JobID=390031 State=0x8004 NodeCnt=1 WTERMSIG 15 [2020-05-18T22:27:19.143] debug2: _slurm_rpc_complete_batch_script JobId=390031: Job/step already completing or completed
"

How do I determine why the job was cancelled? Usually it only happens when the OOM killer strikes but that doesn't seem to the case here.

Thanks,

Torkil


Reply via email to