Thank you Michael for pitching in to trouble shoot the config file. Now my configfile looks like: ClusterName=linux ControlMachine=abhi-Latitude-E6430 SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge SwitchType=switch/none MpiDefault=none ProctrackType=proctrack/pgid Epilog=/usr/local/slurm/sbin/epilog Prolog=/usr/local/slurm/sbin/prolog SlurmdSpoolDir=/var/tmp/slurmd.spool StateSaveLocation=/usr/local/slurm/slurm.state TmpFS=/tmp NodeName=abhi-Lenovo-ideapad-330-15IKB CPUS=4 NodeName=abhi-HP-EliteBook-840-G2 CPUS=4 PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
abhi@abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-05-14 04:11:32 IST; 2h 28min ago Docs: man:slurmd(8) Process: 977 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 1028 (slurmd) Tasks: 2 Memory: 3.9M CGroup: /system.slice/slurmd.service └─1028 /usr/sbin/slurmd abhi@abhi-HP-EliteBook-840-G2:~$ service slurmd status ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-05-14 04:18:51 IST; 2h 24min ago Docs: man:slurmd(8) Process: 1313 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 1372 (slurmd) Tasks: 2 Memory: 3.8M CGroup: /system.slice/slurmd.service └─1372 /usr/sbin/slurmd abhi@abhi-Latitude-E6430:~$ service slurmctld status ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-05-14 04:11:21 IST; 2h 32min ago Docs: man:slurmctld(8) Process: 1208 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 1306 (slurmctld) Tasks: 7 Memory: 6.7M CGroup: /system.slice/slurmctld.service └─1306 /usr/sbin/slurmctld However still: sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB My Study is inconclusive Best Regards,Abhinandan H. Patil, +919886406214https://www.AbhinandanHPatil.info ----- Forwarded message ----- From: "slurm-users-requ...@lists.schedmd.com" <slurm-users-requ...@lists.schedmd.com>To: "slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com>Sent: Thursday, 14 May 2020, 2:39:40 am GMT+5:30Subject: slurm-users Digest, Vol 31, Issue 50 Send slurm-users mailing list submissions to slurm-users@lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users or, via email, send a message with subject or body 'help' to slurm-users-requ...@lists.schedmd.com You can reach the person managing the list at slurm-users-ow...@lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: Ubuntu Cluster with Slurm (Renfro, Michael) 2. Re: sacct returns nothing after reboot (Roger Mason) 3. Re: Reset TMPDIR for All Jobs (Ellestad, Erik) 4. Re: additional jobs killed by scancel. (Alastair Neil) ---------------------------------------------------------------------- Message: 1 Date: Wed, 13 May 2020 14:05:21 +0000 From: "Renfro, Michael" <ren...@tntech.edu> To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Ubuntu Cluster with Slurm Message-ID: <b4e26014-e420-4506-8a7f-dcdf01e4a...@tntech.edu> Content-Type: text/plain; charset="utf-8" I?d compare the RealMemory part of ?scontrol show node abhi-HP-EliteBook-840-G2? to the RealMemory part of your slurm.conf: > Nodes which register to the system with less than the configured resources > (e.g. too little memory), will be placed in the "DOWN" state to avoid > scheduling jobs on them. ? https://slurm.schedmd.com/slurm.conf.html As far as GPUs go, it looks like you have Intel graphics on the Lenovo and a Radeon R7 on the HP? If so, then nothing is CUDA-compatible, but you might be able to make something work with OpenCL. No idea if that would give performance improvements over the CPUs, though. -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On May 13, 2020, at 8:42 AM, Abhinandan Patil > <abhinandan_patil_1...@yahoo.com> wrote: > > Dear All, > > Preamble > ---------- > I want to form simple cluster with three laptops: > abhi-Latitude-E6430 //This serves as the controller > abhi-Lenovo-ideapad-330-15IKB //Compute Node > abhi-HP-EliteBook-840-G2 //Compute Node > > > Aim > ------------- > I want to make use of CPU+GPU+RAM on all the machines when I execute JAVA > programs or Python programs. > > > Implementation > ------------------------ > Now let us look at the slurm.conf > > On Machine abhi-Latitude-E6430 > > ClusterName=linux > ControlMachine=abhi-Latitude-E6430 > SlurmUser=abhi > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > SwitchType=switch/none > StateSaveLocation=/tmp > MpiDefault=none > ProctrackType=proctrack/pgid > NodeName=abhi-Lenovo-ideapad-330-15IKB RealMemory=12000 CPUs=2 > NodeName=abhi-HP-EliteBook-840-G2 RealMemory=14000 CPUs=2 > PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > Same slurm.conf is copied to all the Machines. > > > Observations > -------------------------------------- > Now when I do > abhi@abhi-HP-EliteBook-840-G2:~$ service slurmd status > ? slurmd.service - Slurm node daemon > Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor >preset: enabled) > Active: active (running) since Wed 2020-05-13 18:50:01 IST; 1min 49s ago > Docs: man:slurmd(8) > Process: 98235 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, >status=0/SUCCESS) > Main PID: 98253 (slurmd) > Tasks: 2 > Memory: 2.2M > CGroup: /system.slice/slurmd.service > ??98253 /usr/sbin/slurmd > > abhi@abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status > ? slurmd.service - Slurm node daemon > Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor >preset: enabled) > Active: active (running) since Wed 2020-05-13 18:50:20 IST; 8s ago > Docs: man:slurmd(8) > Process: 71709 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, >status=0/SUCCESS) > Main PID: 71734 (slurmd) > Tasks: 2 > Memory: 2.0M > CGroup: /system.slice/slurmd.service > ??71734 /usr/sbin/slurmd > > abhi@abhi-Latitude-E6430:~$ service slurmctld status > ? slurmctld.service - Slurm controller daemon > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor >preset: enabled) > Active: active (running) since Wed 2020-05-13 18:48:58 IST; 4min 56s ago > Docs: man:slurmctld(8) > Process: 97114 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS >(code=exited, status=0/SUCCESS) > Main PID: 97116 (slurmctld) > Tasks: 7 > Memory: 2.6M > CGroup: /system.slice/slurmctld.service > ??97116 /usr/sbin/slurmctld > > > However abhi@abhi-Latitude-E6430:~$ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB > > > Advice needed > ------------------------ > Please let me know Why I am seeing only one node. > Further how the total memory is calculated? Can Slurm make use of GPU > processing power as well > Please let me know if I have missed something in configuration or explanation. > > Thank you all > > Best Regards, > Abhinandan H. Patil, +919886406214 > https://www.AbhinandanHPatil.info > > ------------------------------ Message: 2 Date: Wed, 13 May 2020 12:20:11 -0230 From: Roger Mason <rma...@mun.ca> To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] sacct returns nothing after reboot Message-ID: <y65sgg399ek....@mun.ca> Content-Type: text/plain Hello, Marcus Boden <mbo...@gwdg.de> writes: > the default time window starts at 00:00:00 of the current day: > -S, --starttime > Select jobs in any state after the specified time. Default > is 00:00:00 of the current day, unless the '-s' or '-j' > options are used. If the '-s' option is used, then the > default is 'now'. If states are given with the '-s' option > then only jobs in this state at this time will be returned. > If the '-j' option is used, then the default time is Unix > Epoch 0. See the DEFAULT TIME WINDOW for more details. Thank you! Obviously I did not read far enough down the man page. Roger ------------------------------ Message: 3 Date: Wed, 13 May 2020 15:18:09 +0000 From: "Ellestad, Erik" <erik.elles...@ucsf.edu> To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Reset TMPDIR for All Jobs Message-ID: <by5pr05mb690060b056d48d6b031a2da99a...@by5pr05mb6900.namprd05.prod.outlook.com> Content-Type: text/plain; charset="utf-8" Woo! Thanks Marcus, that works. Though, ahem, SLURM/SchedMD, if you're listening, would it hurt to cover this in the documentation regarding prolog/epilog, and maybe give an example? https://slurm.schedmd.com/prolog_epilog.html Just a thought, Erik -- Erik Ellestad Wynton Cluster SysAdmin UCSF -----Original Message----- From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Marcus Wagner Sent: Tuesday, May 12, 2020 10:08 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Reset TMPDIR for All Jobs Hi Erik, the output of task-prolog is sourced/evaluated (not really sure, how) in the job environment. Thus you don't have to export a variable in task-prolog, but echo the export, e.g. echo export TMPDIR=/scratch/$SLURM_JOB_ID The variable will then be set in job environment. Best Marcus Am 12.05.2020 um 17:40 schrieb Ellestad, Erik: > I was wanted to set TMPDIR from /tmp to a per job directory I create in > local /scratch/$SLURM_JOB_ID (for example) > > This bug suggests I should be able to do this in a task-prolog. > > https://bugs.schedmd.com/show_bug.cgi?id=2664 > > However adding the following to task-prolog doesn?t seem to affect the > variables the job script is running with. > > unset TMPDIR > > export TMPDIR=/scratch/$SLURM_JOB_ID > > It does work if it is done in the job script, rather than the task-prolog. > > Am I missing something? > > Erik > > -- > > Erik Ellestad > > Wynton Cluster SysAdmin > > UCSF > ------------------------------ Message: 4 Date: Wed, 13 May 2020 17:08:55 -0400 From: Alastair Neil <ajneil.t...@gmail.com> To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] additional jobs killed by scancel. Message-ID: <ca+sarwpqmepkhwlc_ruqsi1szanb8mhk77wcsfqaffytb7f...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" invalid field requested: "reason" On Tue, 12 May 2020 at 16:47, Steven Dick <kg4...@gmail.com> wrote: > What do you get from > > sacct -o jobid,elapsed,reason,exit -j 533900,533902 > > On Tue, May 12, 2020 at 4:12 PM Alastair Neil <ajneil.t...@gmail.com> > wrote: > > > > The log is continuous and has all the messages logged by slurmd on the > node for all the jobs mentioned, below are the entries from the slurmctld > log: > > > >> [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB > JobId=533898 uid 1224431221 > >> > >> [2020-05-10T00:26:03.098] email msg to sshr...@masonlive.gmu.edu: > Slurm Job_id=533898 Name=r18-relu-ent Ended, Run time 04:36:17, CANCELLED, > ExitCode 0 > >> [2020-05-10T00:26:03.098] job_signal: 9 of running JobId=533898 > successful 0x8004 > >> [2020-05-10T00:26:05.204] _job_complete: JobId=533902 WTERMSIG 9 > >> [2020-05-10T00:26:05.204] email msg to sshr...@masonlive.gmu.edu: > Slurm Job_id=533902 Name=r18-soft-ent Failed, Run time 04:30:39, FAILED > >> [2020-05-10T00:26:05.205] _job_complete: JobId=533902 done > >> [2020-05-10T00:26:05.210] _job_complete: JobId=533900 WTERMSIG 9 > >> [2020-05-10T00:26:05.210] email msg to sshr...@masonlive.gmu.edu: > Slurm Job_id=533900 Name=r18-soft Failed, Run time 04:32:51, FAILED > >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done > > > > > > it is curious, that all the jobs were running on the same processor, > perhaps this is a cgroup related failure? > > > > On Tue, 12 May 2020 at 10:10, Steven Dick <kg4...@gmail.com> wrote: > >> > >> I see one job cancelled and two jobs failed. > >> Your slurmd log is incomplete -- it doesn't show the two failed jobs > >> exiting/failing, so the real error is not here. > >> > >> It might also be helpful to look through slurmctld's log starting from > >> when the first job was canceled, looking at any messages mentioning > >> the node or the two failed jobs. > >> > >> I've had nodes do strange things on job cancel. Last one I tracked > >> down to the job epilog failing because it was NFS mounted and nfs was > >> being slower than slurm liked, so it took the node offline and killed > >> everything on it. > >> > >> On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajneil.t...@gmail.com> > wrote: > >> > > >> > Hi there, > >> > > >> > We are using slurm 18.08 and had a weird occurrence over the > weekend. A user canceled one of his jobs using scancel, and two additional > jobs of the user running on the same node were killed concurrently. The > jobs had no dependency, but they were all allocated 1 gpu. I am curious to > know why this happened, and if this is a known bug is there a workaround > to prevent it happening? Any suggestions gratefully received. > >> > > >> > -Alastair > >> > > >> > FYI > >> > The cancelled job (533898) has this at the end of the .err file: > >> > > >> >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT > 2020-05-10T00:26:03 *** > >> > > >> > > >> > both of the killed jobs (533900 and 533902) have this: > >> > > >> >> slurmstepd: error: get_exit_code task 0 died by signal > >> > > >> > > >> > here is the slurmd log from the node and ths how-job output for each > job: > >> > > >> >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4 > >> >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job > 533898 ran for 0 seconds > >> >> [2020-05-09T19:49:46.754] ==================== > >> >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB > >> >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc > >> >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc > >> >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc > >> >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc > >> >> [2020-05-09T19:49:46.758] ==================== > >> >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID > 1224431221 > >> >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3 > >> >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job > 533900 ran for 0 seconds > >> >> [2020-05-09T19:53:14.080] ==================== > >> >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB > >> >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc > >> >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc > >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc > >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc > >> >> [2020-05-09T19:53:14.084] ==================== > >> >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID > 1224431221 > >> >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21 > >> >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job > 533902 ran for 0 seconds > >> >> [2020-05-09T19:55:26.304] ==================== > >> >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB > >> >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc > >> >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc > >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc > >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc > >> >> [2020-05-09T19:55:26.307] ==================== > >> >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID > 1224431221 > >> >> [2020-05-10T00:26:03.127] [533898.extern] done with job > >> >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON > NODE056 CANCELLED AT 2020-05-10T00:26:03 *** > >> >> [2020-05-10T00:26:04.425] [533898.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 > >> >> [2020-05-10T00:26:04.428] [533898.batch] done with job > >> >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 > died by signal > >> >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 > died by signal > >> >> [2020-05-10T00:26:05.202] [533900.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9 > >> >> [2020-05-10T00:26:05.202] [533902.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9 > >> >> [2020-05-10T00:26:05.211] [533902.batch] done with job > >> >> [2020-05-10T00:26:05.216] [533900.batch] done with job > >> >> [2020-05-10T00:26:05.234] [533902.extern] done with job > >> >> [2020-05-10T00:26:05.235] [533900.extern] done with job > >> > > >> > > >> >> [root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt > >> >> JobId=533898 JobName=r18-relu-ent > >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A > >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos > >> >> JobState=CANCELLED Reason=None Dependency=(null) > >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15 > >> >> RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A > >> >> SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45 > >> >> AccrueTime=2020-05-09T19:49:45 > >> >> StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 > Deadline=N/A > >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 > >> >> LastSchedEval=2020-05-09T19:49:46 > >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221 > >> >> ReqNodeList=(null) ExcNodeList=(null) > >> >> NodeList=NODE056 > >> >> BatchHost=NODE056 > >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:* > >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1 > >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0 > >> >> Features=(null) DelayBoot=00:00:00 > >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > >> >> > Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm > >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project > >> >> > StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err > >> >> StdIn=/dev/null > >> >> > StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out > >> >> Power= > >> >> TresPerNode=gpu:1 > >> >> > >> >> JobId=533900 JobName=r18-soft > >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A > >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos > >> >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null) > >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9 > >> >> RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A > >> >> SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13 > >> >> AccrueTime=2020-05-09T19:53:13 > >> >> StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 > Deadline=N/A > >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 > >> >> LastSchedEval=2020-05-09T19:53:14 > >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221 > >> >> ReqNodeList=(null) ExcNodeList=(null) > >> >> NodeList=NODE056 > >> >> BatchHost=NODE056 > >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:* > >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1 > >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0 > >> >> Features=(null) DelayBoot=00:00:00 > >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > >> >> > Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm > >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project > >> >> > StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err > >> >> StdIn=/dev/null > >> >> > StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out > >> >> Power= > >> >> TresPerNode=gpu:1 > >> >> > >> >> JobId=533902 JobName=r18-soft-ent > >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A > >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos > >> >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null) > >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9 > >> >> RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A > >> >> SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26 > >> >> AccrueTime=2020-05-09T19:55:26 > >> >> StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 > Deadline=N/A > >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 > >> >> LastSchedEval=2020-05-09T19:55:26 > >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221 > >> >> ReqNodeList=(null) ExcNodeList=(null) > >> >> NodeList=NODE056 > >> >> BatchHost=NODE056 > >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:* > >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1 > >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0 > >> >> Features=(null) DelayBoot=00:00:00 > >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > >> >> > Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm > >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project > >> >> > StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err > >> >> StdIn=/dev/null > >> >> > StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out > >> >> Power= > >> >> TresPerNode=gpu:1 > >> > > >> > > >> > > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200513/8ff7b80b/attachment.htm> End of slurm-users Digest, Vol 31, Issue 50 *******************************************