Overzealous node cleanup epilog script? 

> On 11 May 2020, at 17:56, Alastair Neil <ajneil.t...@gmail.com> wrote:
> 
> 
> Hi there,
> 
> We are using slurm 18.08 and had a weird occurrence over the weekend.  A user 
> canceled one of his jobs using scancel, and two additional jobs of the user 
> running on the same node were killed concurrently.  The jobs had no 
> dependency, but they were all allocated 1 gpu. I am curious to know why this 
> happened,  and if this is a known bug is there a workaround to prevent it 
> happening?  Any suggestions gratefully received.
> 
> -Alastair
> 
> FYI
> The cancelled job (533898) has this at the end of the .err file:
> 
>> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT 
>> 2020-05-10T00:26:03 ***
> 
> both of the killed jobs (533900 and 533902)  have this:
> 
>> slurmstepd: error: get_exit_code task 0 died by signal
> 
> 
> here is the slurmd log from the node and ths how-job output for each job:
> 
>> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
>> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job 533898 ran 
>> for 0 seconds
>> [2020-05-09T19:49:46.754] ====================
>> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
>> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
>> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
>> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
>> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
>> [2020-05-09T19:49:46.758] ====================
>> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID 1224431221
>> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
>> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job 533900 ran 
>> for 0 seconds
>> [2020-05-09T19:53:14.080] ====================
>> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
>> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
>> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
>> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
>> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
>> [2020-05-09T19:53:14.084] ====================
>> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID 1224431221
>> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
>> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job 533902 ran 
>> for 0 seconds
>> [2020-05-09T19:55:26.304] ====================
>> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
>> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
>> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
>> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
>> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
>> [2020-05-09T19:55:26.307] ====================
>> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID 1224431221
>> [2020-05-10T00:26:03.127] [533898.extern] done with job
>> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON NODE056 
>> CANCELLED AT 2020-05-10T00:26:03 ***
>> [2020-05-10T00:26:04.425] [533898.batch] sending 
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
>> [2020-05-10T00:26:04.428] [533898.batch] done with job
>> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 died by 
>> signal
>> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 died by 
>> signal
>> [2020-05-10T00:26:05.202] [533900.batch] sending 
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> [2020-05-10T00:26:05.202] [533902.batch] sending 
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> [2020-05-10T00:26:05.211] [533902.batch] done with job
>> [2020-05-10T00:26:05.216] [533900.batch] done with job
>> [2020-05-10T00:26:05.234] [533902.extern] done with job
>> [2020-05-10T00:26:05.235] [533900.extern] done with job
> 
>> [root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
>> JobId=533898 JobName=r18-relu-ent
>>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>>  JobState=CANCELLED Reason=None Dependency=(null)
>>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
>>  RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
>>  SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
>>  AccrueTime=2020-05-09T19:49:45
>>  StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 Deadline=N/A
>>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>  LastSchedEval=2020-05-09T19:49:46
>>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>>  ReqNodeList=(null) ExcNodeList=(null)
>>  NodeList=NODE056
>>  BatchHost=NODE056
>>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>>  Features=(null) DelayBoot=00:00:00
>>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>  
>> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
>>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>>  
>> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
>>  StdIn=/dev/null
>>  
>> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
>>  Power=
>>  TresPerNode=gpu:1
>> 
>> JobId=533900 JobName=r18-soft
>>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>>  RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
>>  SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
>>  AccrueTime=2020-05-09T19:53:13
>>  StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 Deadline=N/A
>>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>  LastSchedEval=2020-05-09T19:53:14
>>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>>  ReqNodeList=(null) ExcNodeList=(null)
>>  NodeList=NODE056
>>  BatchHost=NODE056
>>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>>  Features=(null) DelayBoot=00:00:00
>>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>  
>> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
>>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>>  
>> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
>>  StdIn=/dev/null
>>  
>> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
>>  Power=
>>  TresPerNode=gpu:1
>> 
>> JobId=533902 JobName=r18-soft-ent
>>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>>  RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
>>  SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
>>  AccrueTime=2020-05-09T19:55:26
>>  StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 Deadline=N/A
>>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>  LastSchedEval=2020-05-09T19:55:26
>>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>>  ReqNodeList=(null) ExcNodeList=(null)
>>  NodeList=NODE056
>>  BatchHost=NODE056
>>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>>  Features=(null) DelayBoot=00:00:00
>>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>  
>> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
>>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>>  
>> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
>>  StdIn=/dev/null
>>  
>> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
>>  Power=
>>  TresPerNode=gpu:1
> 
> 

Reply via email to