I am writing to seek assistance with a critical issue on our single-node
system managed by Slurm. Our jobs are queued and marked as awaiting
resources, but they are not starting despite seeming availability. I'm new
with SLURM and my only experience was a class on installing it so I have no
experience, running it or using it.

Issue Summary:

Main Problem: Jobs submitted only one run and the second says *NODELIST(REASON)
(Resources*). I've checked that our single node has enough RAM (2TB) and
CPU's (64) available.

# COMPUTE NODES
NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1
RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1
PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00
MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force


System Details: We have a single-node setup with Slurm as the workload
manager. The node appears to have sufficient resources for the queued jobs.

Troubleshooting Performed:
Configuration Checks: I have verified all Slurm configurations and the
system's resource availability, which should not be limiting job execution.
Service Status: The Slurm daemon slurmdbd is active and running without any
reported issues. System resource monitoring shows no shortages that would
prevent job initiation.

Any guidance and help will be deeply appreciated!

-- 
*Alison Peterson*
IT Research Support Analyst
*Information Technology*
apeters...@sdsu.edu <mfar...@sdsu.edu>
O: 619-594-3364
*San Diego State University | SDSU.edu <http://sdsu.edu/>*
5500 Campanile Drive | San Diego, CA 92182-8080
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to