I see one job cancelled and two jobs failed. Your slurmd log is incomplete -- it doesn't show the two failed jobs exiting/failing, so the real error is not here.
It might also be helpful to look through slurmctld's log starting from when the first job was canceled, looking at any messages mentioning the node or the two failed jobs. I've had nodes do strange things on job cancel. Last one I tracked down to the job epilog failing because it was NFS mounted and nfs was being slower than slurm liked, so it took the node offline and killed everything on it. On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajneil.t...@gmail.com> wrote: > > Hi there, > > We are using slurm 18.08 and had a weird occurrence over the weekend. A user > canceled one of his jobs using scancel, and two additional jobs of the user > running on the same node were killed concurrently. The jobs had no > dependency, but they were all allocated 1 gpu. I am curious to know why this > happened, and if this is a known bug is there a workaround to prevent it > happening? Any suggestions gratefully received. > > -Alastair > > FYI > The cancelled job (533898) has this at the end of the .err file: > >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT >> 2020-05-10T00:26:03 *** > > > both of the killed jobs (533900 and 533902) have this: > >> slurmstepd: error: get_exit_code task 0 died by signal > > > here is the slurmd log from the node and ths how-job output for each job: > >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4 >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job 533898 ran >> for 0 seconds >> [2020-05-09T19:49:46.754] ==================== >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc >> [2020-05-09T19:49:46.758] ==================== >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID 1224431221 >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3 >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job 533900 ran >> for 0 seconds >> [2020-05-09T19:53:14.080] ==================== >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc >> [2020-05-09T19:53:14.084] ==================== >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID 1224431221 >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21 >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job 533902 ran >> for 0 seconds >> [2020-05-09T19:55:26.304] ==================== >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc >> [2020-05-09T19:55:26.307] ==================== >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID 1224431221 >> [2020-05-10T00:26:03.127] [533898.extern] done with job >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON NODE056 >> CANCELLED AT 2020-05-10T00:26:03 *** >> [2020-05-10T00:26:04.425] [533898.batch] sending >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 >> [2020-05-10T00:26:04.428] [533898.batch] done with job >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 died by >> signal >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 died by >> signal >> [2020-05-10T00:26:05.202] [533900.batch] sending >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9 >> [2020-05-10T00:26:05.202] [533902.batch] sending >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9 >> [2020-05-10T00:26:05.211] [533902.batch] done with job >> [2020-05-10T00:26:05.216] [533900.batch] done with job >> [2020-05-10T00:26:05.234] [533902.extern] done with job >> [2020-05-10T00:26:05.235] [533900.extern] done with job > > >> [root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt >> JobId=533898 JobName=r18-relu-ent >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos >> JobState=CANCELLED Reason=None Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15 >> RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A >> SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45 >> AccrueTime=2020-05-09T19:49:45 >> StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 Deadline=N/A >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> LastSchedEval=2020-05-09T19:49:46 >> Partition=gpuq AllocNode:Sid=ARGO-2:7221 >> ReqNodeList=(null) ExcNodeList=(null) >> NodeList=NODE056 >> BatchHost=NODE056 >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:* >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1 >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0 >> Features=(null) DelayBoot=00:00:00 >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) >> >> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project >> >> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err >> StdIn=/dev/null >> >> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out >> Power= >> TresPerNode=gpu:1 >> >> JobId=533900 JobName=r18-soft >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9 >> RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A >> SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13 >> AccrueTime=2020-05-09T19:53:13 >> StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 Deadline=N/A >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> LastSchedEval=2020-05-09T19:53:14 >> Partition=gpuq AllocNode:Sid=ARGO-2:7221 >> ReqNodeList=(null) ExcNodeList=(null) >> NodeList=NODE056 >> BatchHost=NODE056 >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:* >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1 >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0 >> Features=(null) DelayBoot=00:00:00 >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) >> >> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project >> >> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err >> StdIn=/dev/null >> >> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out >> Power= >> TresPerNode=gpu:1 >> >> JobId=533902 JobName=r18-soft-ent >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9 >> RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A >> SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26 >> AccrueTime=2020-05-09T19:55:26 >> StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 Deadline=N/A >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> LastSchedEval=2020-05-09T19:55:26 >> Partition=gpuq AllocNode:Sid=ARGO-2:7221 >> ReqNodeList=(null) ExcNodeList=(null) >> NodeList=NODE056 >> BatchHost=NODE056 >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:* >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1 >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0 >> Features=(null) DelayBoot=00:00:00 >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) >> >> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project >> >> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err >> StdIn=/dev/null >> >> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out >> Power= >> TresPerNode=gpu:1 > > >