Thank you!!!! That was the issue, I'm so happy :-) sending you many thanks.
On Thu, Apr 4, 2024 at 10:11 AM Renfro, Michael <ren...@tntech.edu> wrote: > Yep, from your scontrol show node output: > > CfgTRES=cpu=64,mem=2052077M,billing=64 > AllocTRES=cpu=1,mem=2052077M > > > > The running job (77) has allocated 1 CPU and all the memory on the node. > That’s probably due to the partition using the default DefMemPerCPU value > [1], which is unlimited. > > > > Since all our nodes are shared, and our workloads vary widely, we set our > DefMemPerCPU value to something considerably lower than > mem_in_node/cores_in_node . That way, most jobs will leave some memory > available by default, and other jobs can use that extra memory as long as > CPUs are available. > > > > [1] https://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU > > > > *From: *Alison Peterson <apeters...@sdsu.edu> > *Date: *Thursday, April 4, 2024 at 11:58 AM > *To: *Renfro, Michael <ren...@tntech.edu> > *Subject: *Re: [EXT] Re: [slurm-users] SLURM configuration help > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > ------------------------------ > > Here is the info: > > *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show node cusco* > > > NodeName=cusco Arch=x86_64 CoresPerSocket=32 > CPUAlloc=1 CPUTot=64 CPULoad=0.02 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=gpu:4 > NodeAddr=cusco NodeHostName=cusco Version=19.05.5 > OS=Linux 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024 > RealMemory=2052077 AllocMem=2052077 FreeMem=1995947 Sockets=2 Boards=1 > State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=mainpart > BootTime=2024-03-01T17:06:26 SlurmdStartTime=2024-03-01T17:06:53 > CfgTRES=cpu=64,mem=2052077M,billing=64 > AllocTRES=cpu=1,mem=2052077M > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ squeue* > > > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 78 mainpart CF1090_w sma PD 0:00 1 > (Resources) > 77 mainpart CF0000_w sma R 0:26 1 cusco > > *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show job 78* > > > JobId=78 JobName=CF1090_wOcean500m.shell > UserId=sma(1008) GroupId=myfault(1001) MCS_label=N/A > Priority=4294901720 Nice=0 Account=(null) QOS=(null) > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A > SubmitTime=2024-04-04T09:55:34 EligibleTime=2024-04-04T09:55:34 > AccrueTime=2024-04-04T09:55:34 > StartTime=2024-04-04T10:55:28 EndTime=2024-04-04T11:55:28 Deadline=N/A > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-04T09:55:58 > Partition=mainpart AllocNode:Sid=newcusco:2450574 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) SchedNodeList=cusco > NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=1,node=1,billing=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) > Command=/data/work/sma-scratch/tohoku_wOcean/CF1090_wOcean500m.shell > WorkDir=/data/work/sma-scratch/tohoku_wOcean > StdErr=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out > StdIn=/dev/null > StdOut=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out > Power= > > > > On Thu, Apr 4, 2024 at 8:57 AM Renfro, Michael <ren...@tntech.edu> wrote: > > What does “scontrol show node cusco” and “scontrol show job > PENDING_JOB_ID” show? > > > > On one job we currently have that’s pending due to Resources, that job has > requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but > the node it wants to run on only has 37 CPUs available (seen by comparing > its CfgTRES= and AllocTRES= values). > > > > *From: *Alison Peterson via slurm-users <slurm-users@lists.schedmd.com> > *Date: *Thursday, April 4, 2024 at 10:43 AM > *To: *slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> > *Subject: *[slurm-users] SLURM configuration help > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > ------------------------------ > > I am writing to seek assistance with a critical issue on our single-node > system managed by Slurm. Our jobs are queued and marked as awaiting > resources, but they are not starting despite seeming availability. I'm new > with SLURM and my only experience was a class on installing it so I have no > experience, running it or using it. > > Issue Summary: > > Main Problem: Jobs submitted only one run and the second says > *NODELIST(REASON) > (Resources*). I've checked that our single node has enough RAM (2TB) and > CPU's (64) available. > > > > # COMPUTE NODES > NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 > RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1 > PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00 > MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force > > > > > > System Details: We have a single-node setup with Slurm as the workload > manager. The node appears to have sufficient resources for the queued jobs. > > Troubleshooting Performed: > Configuration Checks: I have verified all Slurm configurations and the > system's resource availability, which should not be limiting job execution. > Service Status: The Slurm daemon slurmdbd is active and running without > any reported issues. System resource monitoring shows no shortages that > would prevent job initiation. > > > > Any guidance and help will be deeply appreciated! > > > > -- > > *Alison Peterson* > > IT Research Support Analyst > *Information Technology* > > apeters...@sdsu.edu <mfar...@sdsu.edu> > > O: 619-594-3364 > > *San Diego State University | **SDSU.edu <http://sdsu.edu/>* > > 5500 Campanile Drive | San Diego, CA 92182-8080 > > *Error! Filename not specified.* > > > > > > > -- > > *Alison Peterson* > > IT Research Support Analyst > *Information Technology* > > apeters...@sdsu.edu <mfar...@sdsu.edu> > > O: 619-594-3364 > > *San Diego State University | **SDSU.edu <http://sdsu.edu/>* > > 5500 Campanile Drive | San Diego, CA 92182-8080 > > [image: Image removed by sender.] > > > -- *Alison Peterson* IT Research Support Analyst *Information Technology* apeters...@sdsu.edu <mfar...@sdsu.edu> O: 619-594-3364 *San Diego State University | SDSU.edu <http://sdsu.edu/>* 5500 Campanile Drive | San Diego, CA 92182-8080
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com