Yep, from your scontrol show node output:

CfgTRES=cpu=64,mem=2052077M,billing=64
AllocTRES=cpu=1,mem=2052077M

The running job (77) has allocated 1 CPU and all the memory on the node. That’s 
probably due to the partition using the default DefMemPerCPU value [1], which 
is unlimited.

Since all our nodes are shared, and our workloads vary widely, we set our 
DefMemPerCPU value to something considerably lower than 
mem_in_node/cores_in_node . That way, most jobs will leave some memory 
available by default, and other jobs can use that extra memory as long as CPUs 
are available.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU

From: Alison Peterson <apeters...@sdsu.edu>
Date: Thursday, April 4, 2024 at 11:58 AM
To: Renfro, Michael <ren...@tntech.edu>
Subject: Re: [EXT] Re: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________
Here is the info:
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show node cusco

NodeName=cusco Arch=x86_64 CoresPerSocket=32
   CPUAlloc=1 CPUTot=64 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:4
   NodeAddr=cusco NodeHostName=cusco Version=19.05.5
   OS=Linux 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
   RealMemory=2052077 AllocMem=2052077 FreeMem=1995947 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=mainpart
   BootTime=2024-03-01T17:06:26 SlurmdStartTime=2024-03-01T17:06:53
   CfgTRES=cpu=64,mem=2052077M,billing=64
   AllocTRES=cpu=1,mem=2052077M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                78  mainpart CF1090_w      sma PD       0:00      1 (Resources)
                77  mainpart CF0000_w      sma  R       0:26      1 cusco
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show job 78

JobId=78 JobName=CF1090_wOcean500m.shell
   UserId=sma(1008) GroupId=myfault(1001) MCS_label=N/A
   Priority=4294901720 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2024-04-04T09:55:34 EligibleTime=2024-04-04T09:55:34
   AccrueTime=2024-04-04T09:55:34
   StartTime=2024-04-04T10:55:28 EndTime=2024-04-04T11:55:28 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-04T09:55:58
   Partition=mainpart AllocNode:Sid=newcusco:2450574
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cusco
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/work/sma-scratch/tohoku_wOcean/CF1090_wOcean500m.shell
   WorkDir=/data/work/sma-scratch/tohoku_wOcean
   StdErr=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
   StdIn=/dev/null
   StdOut=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
   Power=

On Thu, Apr 4, 2024 at 8:57 AM Renfro, Michael 
<ren...@tntech.edu<mailto:ren...@tntech.edu>> wrote:
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” 
show?

On one job we currently have that’s pending due to Resources, that job has 
requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the 
node it wants to run on only has 37 CPUs available (seen by comparing its 
CfgTRES= and AllocTRES= values).

From: Alison Peterson via slurm-users 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Date: Thursday, April 4, 2024 at 10:43 AM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________
I am writing to seek assistance with a critical issue on our single-node system 
managed by Slurm. Our jobs are queued and marked as awaiting resources, but 
they are not starting despite seeming availability. I'm new with SLURM and my 
only experience was a class on installing it so I have no experience, running 
it or using it.

Issue Summary:

Main Problem: Jobs submitted only one run and the second says NODELIST(REASON) 
(Resources). I've checked that our single node has enough RAM (2TB) and CPU's 
(64) available.

# COMPUTE NODES
NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 
RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1
PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00 
MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force


System Details: We have a single-node setup with Slurm as the workload manager. 
The node appears to have sufficient resources for the queued jobs.
Troubleshooting Performed:
Configuration Checks: I have verified all Slurm configurations and the system's 
resource availability, which should not be limiting job execution.
Service Status: The Slurm daemon slurmdbd is active and running without any 
reported issues. System resource monitoring shows no shortages that would 
prevent job initiation.

Any guidance and help will be deeply appreciated!

--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu<mailto:mfar...@sdsu.edu>
O: 619-594-3364
San Diego State University | SDSU.edu<http://sdsu.edu/>
5500 Campanile Drive | San Diego, CA 92182-8080
Error! Filename not specified.



--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu<mailto:mfar...@sdsu.edu>
O: 619-594-3364
San Diego State University | SDSU.edu<http://sdsu.edu/>
5500 Campanile Drive | San Diego, CA 92182-8080
[Image removed by sender.]

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to