[slurm-users] Re: [EXT] Re: [EXT] Re: SLURM configuration help

Alison Peterson via slurm-users Thu, 04 Apr 2024 10:56:20 -0700

Thank you!!!! That was the issue, I'm so happy :-) sending you many thanks.


On Thu, Apr 4, 2024 at 10:11 AM Renfro, Michael <ren...@tntech.edu> wrote:

> Yep, from your scontrol show node output:
>
> CfgTRES=cpu=64,mem=2052077M,billing=64
> AllocTRES=cpu=1,mem=2052077M
>
>
>
> The running job (77) has allocated 1 CPU and all the memory on the node.
> That’s probably due to the partition using the default DefMemPerCPU value
> [1], which is unlimited.
>
>
>
> Since all our nodes are shared, and our workloads vary widely, we set our
> DefMemPerCPU value to something considerably lower than
> mem_in_node/cores_in_node . That way, most jobs will leave some memory
> available by default, and other jobs can use that extra memory as long as
> CPUs are available.
>
>
>
> [1] https://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU
>
>
>
> *From: *Alison Peterson <apeters...@sdsu.edu>
> *Date: *Thursday, April 4, 2024 at 11:58 AM
> *To: *Renfro, Michael <ren...@tntech.edu>
> *Subject: *Re: [EXT] Re: [slurm-users] SLURM configuration help
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> ------------------------------
>
> Here is the info:
>
> *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show node cusco*
>
>
> NodeName=cusco Arch=x86_64 CoresPerSocket=32
>    CPUAlloc=1 CPUTot=64 CPULoad=0.02
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=gpu:4
>    NodeAddr=cusco NodeHostName=cusco Version=19.05.5
>    OS=Linux 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
>    RealMemory=2052077 AllocMem=2052077 FreeMem=1995947 Sockets=2 Boards=1
>    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=mainpart
>    BootTime=2024-03-01T17:06:26 SlurmdStartTime=2024-03-01T17:06:53
>    CfgTRES=cpu=64,mem=2052077M,billing=64
>    AllocTRES=cpu=1,mem=2052077M
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>
> *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ squeue*
>
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                 78  mainpart CF1090_w      sma PD       0:00      1
> (Resources)
>                 77  mainpart CF0000_w      sma  R       0:26      1 cusco
>
> *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show job 78*
>
>
> JobId=78 JobName=CF1090_wOcean500m.shell
>    UserId=sma(1008) GroupId=myfault(1001) MCS_label=N/A
>    Priority=4294901720 Nice=0 Account=(null) QOS=(null)
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
>    SubmitTime=2024-04-04T09:55:34 EligibleTime=2024-04-04T09:55:34
>    AccrueTime=2024-04-04T09:55:34
>    StartTime=2024-04-04T10:55:28 EndTime=2024-04-04T11:55:28 Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-04T09:55:58
>    Partition=mainpart AllocNode:Sid=newcusco:2450574
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null) SchedNodeList=cusco
>    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=1,node=1,billing=1
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
>    Command=/data/work/sma-scratch/tohoku_wOcean/CF1090_wOcean500m.shell
>    WorkDir=/data/work/sma-scratch/tohoku_wOcean
>    StdErr=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
>    StdIn=/dev/null
>    StdOut=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
>    Power=
>
>
>
> On Thu, Apr 4, 2024 at 8:57 AM Renfro, Michael <ren...@tntech.edu> wrote:
>
> What does “scontrol show node cusco” and “scontrol show job
> PENDING_JOB_ID” show?
>
>
>
> On one job we currently have that’s pending due to Resources, that job has
> requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but
> the node it wants to run on only has 37 CPUs available (seen by comparing
> its CfgTRES= and AllocTRES= values).
>
>
>
> *From: *Alison Peterson via slurm-users <slurm-users@lists.schedmd.com>
> *Date: *Thursday, April 4, 2024 at 10:43 AM
> *To: *slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
> *Subject: *[slurm-users] SLURM configuration help
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> ------------------------------
>
> I am writing to seek assistance with a critical issue on our single-node
> system managed by Slurm. Our jobs are queued and marked as awaiting
> resources, but they are not starting despite seeming availability. I'm new
> with SLURM and my only experience was a class on installing it so I have no
> experience, running it or using it.
>
> Issue Summary:
>
> Main Problem: Jobs submitted only one run and the second says 
> *NODELIST(REASON)
> (Resources*). I've checked that our single node has enough RAM (2TB) and
> CPU's (64) available.
>
>
>
> # COMPUTE NODES
> NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1
> RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1
> PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00
> MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force
>
>
>
>
>
> System Details: We have a single-node setup with Slurm as the workload
> manager. The node appears to have sufficient resources for the queued jobs.
>
> Troubleshooting Performed:
> Configuration Checks: I have verified all Slurm configurations and the
> system's resource availability, which should not be limiting job execution.
> Service Status: The Slurm daemon slurmdbd is active and running without
> any reported issues. System resource monitoring shows no shortages that
> would prevent job initiation.
>
>
>
> Any guidance and help will be deeply appreciated!
>
>
>
> --
>
> *Alison Peterson*
>
> IT Research Support Analyst
> *Information Technology*
>
> apeters...@sdsu.edu <mfar...@sdsu.edu>
>
> O: 619-594-3364
>
> *San Diego State University | **SDSU.edu <http://sdsu.edu/>*
>
> 5500 Campanile Drive | San Diego, CA 92182-8080
>
> *Error! Filename not specified.*
>
>
>
>
>
>
> --
>
> *Alison Peterson*
>
> IT Research Support Analyst
> *Information Technology*
>
> apeters...@sdsu.edu <mfar...@sdsu.edu>
>
> O: 619-594-3364
>
> *San Diego State University | **SDSU.edu <http://sdsu.edu/>*
>
> 5500 Campanile Drive | San Diego, CA 92182-8080
>
> [image: Image removed by sender.]
>
>
>


-- 
*Alison Peterson*
IT Research Support Analyst
*Information Technology*
apeters...@sdsu.edu <mfar...@sdsu.edu>
O: 619-594-3364
*San Diego State University | SDSU.edu <http://sdsu.edu/>*
5500 Campanile Drive | San Diego, CA 92182-8080

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXT] Re: [EXT] Re: SLURM configuration help

Reply via email to