Re: [slurm-users] additional jobs killed by scancel.

Steven Dick Wed, 13 May 2020 15:26:32 -0700

Hmm, works for me.  Maybe they added that in more recent versions of slurm.
I'm using version 18+


On Wed, May 13, 2020 at 5:12 PM Alastair Neil <ajneil.t...@gmail.com> wrote:
>
> invalid field requested: "reason"
>
> On Tue, 12 May 2020 at 16:47, Steven Dick <kg4...@gmail.com> wrote:
>>
>> What do you get from
>>
>> sacct -o jobid,elapsed,reason,exit -j 533900,533902
>>
>> On Tue, May 12, 2020 at 4:12 PM Alastair Neil <ajneil.t...@gmail.com> wrote:
>> >
>> >  The log is continuous and has all the messages logged by slurmd on the 
>> > node for all the jobs mentioned, below are the entries from the slurmctld 
>> > log:
>> >
>> >> [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB 
>> >> JobId=533898 uid 1224431221
>> >>
>> >> [2020-05-10T00:26:03.098] email msg to sshr...@masonlive.gmu.edu: Slurm 
>> >> Job_id=533898 Name=r18-relu-ent Ended, Run time 04:36:17, CANCELLED, 
>> >> ExitCode 0
>> >> [2020-05-10T00:26:03.098] job_signal: 9 of running JobId=533898 
>> >> successful 0x8004
>> >> [2020-05-10T00:26:05.204] _job_complete: JobId=533902 WTERMSIG 9
>> >> [2020-05-10T00:26:05.204] email msg to sshr...@masonlive.gmu.edu: Slurm 
>> >> Job_id=533902 Name=r18-soft-ent Failed, Run time 04:30:39, FAILED
>> >> [2020-05-10T00:26:05.205] _job_complete: JobId=533902 done
>> >> [2020-05-10T00:26:05.210] _job_complete: JobId=533900 WTERMSIG 9
>> >> [2020-05-10T00:26:05.210] email msg to sshr...@masonlive.gmu.edu: Slurm 
>> >> Job_id=533900 Name=r18-soft Failed, Run time 04:32:51, FAILED
>> >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done
>> >
>> >
>> > it is curious, that all the jobs were running on the same processor, 
>> > perhaps this is a cgroup related failure?
>> >
>> > On Tue, 12 May 2020 at 10:10, Steven Dick <kg4...@gmail.com> wrote:
>> >>
>> >> I see one job cancelled and two jobs failed.
>> >> Your slurmd log is incomplete -- it doesn't show the two failed jobs
>> >> exiting/failing, so the real error is not here.
>> >>
>> >> It might also be helpful to look through slurmctld's log starting from
>> >> when the first job was canceled, looking at any messages mentioning
>> >> the node or the two failed jobs.
>> >>
>> >> I've had nodes do strange things on job cancel.  Last one I tracked
>> >> down to the job epilog failing because it was NFS mounted and nfs was
>> >> being slower than slurm liked, so it took the node offline and killed
>> >> everything on it.
>> >>
>> >> On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajneil.t...@gmail.com> 
>> >> wrote:
>> >> >
>> >> > Hi there,
>> >> >
>> >> > We are using slurm 18.08 and had a weird occurrence over the weekend.  
>> >> > A user canceled one of his jobs using scancel, and two additional jobs 
>> >> > of the user running on the same node were killed concurrently.  The 
>> >> > jobs had no dependency, but they were all allocated 1 gpu. I am curious 
>> >> > to know why this happened,  and if this is a known bug is there a 
>> >> > workaround to prevent it happening?  Any suggestions gratefully 
>> >> > received.
>> >> >
>> >> > -Alastair
>> >> >
>> >> > FYI
>> >> > The cancelled job (533898) has this at the end of the .err file:
>> >> >
>> >> >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT 
>> >> >> 2020-05-10T00:26:03 ***
>> >> >
>> >> >
>> >> > both of the killed jobs (533900 and 533902)  have this:
>> >> >
>> >> >> slurmstepd: error: get_exit_code task 0 died by signal
>> >> >
>> >> >
>> >> > here is the slurmd log from the node and ths how-job output for each 
>> >> > job:
>> >> >
>> >> >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
>> >> >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job 533898 
>> >> >> ran for 0 seconds
>> >> >> [2020-05-09T19:49:46.754] ====================
>> >> >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
>> >> >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
>> >> >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
>> >> >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
>> >> >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
>> >> >> [2020-05-09T19:49:46.758] ====================
>> >> >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID 1224431221
>> >> >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
>> >> >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job 533900 
>> >> >> ran for 0 seconds
>> >> >> [2020-05-09T19:53:14.080] ====================
>> >> >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
>> >> >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
>> >> >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
>> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
>> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
>> >> >> [2020-05-09T19:53:14.084] ====================
>> >> >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID 1224431221
>> >> >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
>> >> >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job 533902 
>> >> >> ran for 0 seconds
>> >> >> [2020-05-09T19:55:26.304] ====================
>> >> >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
>> >> >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
>> >> >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
>> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
>> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
>> >> >> [2020-05-09T19:55:26.307] ====================
>> >> >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID 1224431221
>> >> >> [2020-05-10T00:26:03.127] [533898.extern] done with job
>> >> >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON 
>> >> >> NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
>> >> >> [2020-05-10T00:26:04.425] [533898.batch] sending 
>> >> >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
>> >> >> [2020-05-10T00:26:04.428] [533898.batch] done with job
>> >> >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 
>> >> >> died by signal
>> >> >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 
>> >> >> died by signal
>> >> >> [2020-05-10T00:26:05.202] [533900.batch] sending 
>> >> >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> >> >> [2020-05-10T00:26:05.202] [533902.batch] sending 
>> >> >> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> >> >> [2020-05-10T00:26:05.211] [533902.batch] done with job
>> >> >> [2020-05-10T00:26:05.216] [533900.batch] done with job
>> >> >> [2020-05-10T00:26:05.234] [533902.extern] done with job
>> >> >> [2020-05-10T00:26:05.235] [533900.extern] done with job
>> >> >
>> >> >
>> >> >> [root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
>> >> >> JobId=533898 JobName=r18-relu-ent
>> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>> >> >>  JobState=CANCELLED Reason=None Dependency=(null)
>> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
>> >> >>  RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
>> >> >>  SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
>> >> >>  AccrueTime=2020-05-09T19:49:45
>> >> >>  StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 Deadline=N/A
>> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> >> >>  LastSchedEval=2020-05-09T19:49:46
>> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>> >> >>  ReqNodeList=(null) ExcNodeList=(null)
>> >> >>  NodeList=NODE056
>> >> >>  BatchHost=NODE056
>> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>> >> >>  Features=(null) DelayBoot=00:00:00
>> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>> >> >>  
>> >> >> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
>> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>> >> >>  
>> >> >> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
>> >> >>  StdIn=/dev/null
>> >> >>  
>> >> >> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
>> >> >>  Power=
>> >> >>  TresPerNode=gpu:1
>> >> >>
>> >> >> JobId=533900 JobName=r18-soft
>> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>> >> >>  RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
>> >> >>  SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
>> >> >>  AccrueTime=2020-05-09T19:53:13
>> >> >>  StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 Deadline=N/A
>> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> >> >>  LastSchedEval=2020-05-09T19:53:14
>> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>> >> >>  ReqNodeList=(null) ExcNodeList=(null)
>> >> >>  NodeList=NODE056
>> >> >>  BatchHost=NODE056
>> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>> >> >>  Features=(null) DelayBoot=00:00:00
>> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>> >> >>  
>> >> >> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
>> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>> >> >>  
>> >> >> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
>> >> >>  StdIn=/dev/null
>> >> >>  
>> >> >> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
>> >> >>  Power=
>> >> >>  TresPerNode=gpu:1
>> >> >>
>> >> >> JobId=533902 JobName=r18-soft-ent
>> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>> >> >>  RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
>> >> >>  SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
>> >> >>  AccrueTime=2020-05-09T19:55:26
>> >> >>  StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 Deadline=N/A
>> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> >> >>  LastSchedEval=2020-05-09T19:55:26
>> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>> >> >>  ReqNodeList=(null) ExcNodeList=(null)
>> >> >>  NodeList=NODE056
>> >> >>  BatchHost=NODE056
>> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>> >> >>  Features=(null) DelayBoot=00:00:00
>> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>> >> >>  
>> >> >> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
>> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>> >> >>  
>> >> >> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
>> >> >>  StdIn=/dev/null
>> >> >>  
>> >> >> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
>> >> >>  Power=
>> >> >>  TresPerNode=gpu:1
>> >> >
>> >> >
>> >> >
>> >>
>>

Re: [slurm-users] additional jobs killed by scancel.

Reply via email to