Hi Patrick Hi Ray Happy Friday! Thank you both for your quick reply. This is what i found out.
With Patrick one liner it works fine. NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 With Ray suggestion i have a error message for each nodes. Here i am giving you only one error message from a node. sacct: error: NodeNames=radonc01 CPUs=32 doesn't match Sockets*CoresPerSocket*ThreadsPerCore (16), resetting CPUs The interesting thing is if you follow the Sockets*CoresPerSocket*ThreadsPerCore formula 2x8x2 = 32 however look above and it says (16) - Strange, no ? Aslo, as Ray suggested NodeAddr=10.112.0.5,10.112.0.6,10.112.0.14,10.112.0.16 comma between IP works fine. So for now I will stay with Patrick’s one-liner. Although this solution did not give any error messages i am still worried that SLURM stills think that Sockets*CoresPerSocket*ThreadsPerCore (16) FYI: Also, the /etc/hosts file on each machine (master and execute nodes) looks like this one. 0.112.0.25 radoncmaster.stanford.EDU<http://radoncmaster.stanford.EDU> radoncmaster 10.112.0.5 radonc01.stanford.EDU<http://radonc01.stanford.EDU> radonc01 10.112.0.6 radonc02.stanford.EDU<http://radonc02.stanford.EDU> radonc02 10.112.0.14 radonc03.stanford.EDU<http://radonc03.stanford.EDU> radonc03 10.112.0.16 radonc04.stanford.EDU<http://radonc04.stanford.EDU> radonc04 Now, when i run sacct it says SLURM accounting storage is disabled which i am ok since i have only two pos-doc at the moment. How can I test my cluster with a sample job and make sure it uses all the CPUs and ram? Thank you for your help and patience with me Best, Eric _____________________________________________________________________________________________________ Eric F. Alemany System Administrator for Research Division of Radiation & Cancer Biology Department of Radiation Oncology Stanford University School of Medicine Stanford, California 94305 Tel:1-650-498-7969<tel:1-650-498-7969> No Texting Fax:1-650-723-7382<tel:1-650-723-7382> On May 4, 2018, at 6:14 AM, Patrick Goetz <pgo...@math.utexas.edu<mailto:pgo...@math.utexas.edu>> wrote: I concur with this. Make sure your nodes are in the /etc/hosts file on the SMS. Also, if you name them by base + numerical sequence, you can configure them with a single line in Slurm (using the example below): NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 On 05/04/2018 12:05 AM, Raymond Wan wrote: Hi Eric, On Fri, May 4, 2018 at 6:04 AM, Eric F. Alemany <ealem...@stanford.edu<mailto:ealem...@stanford.edu>> wrote: # COMPUTE NODES NodeName=radonc[01-04] NodeAddr=10.112.0.5 10.112.0.6 10.112.0.14 10.112.0.16 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=radonc[01-04] Default=YES MaxTime=INFINITE State=UP I don't know what is the problem, but my *guess* based on my own configuration file is that we have one node per line under "NodeName". We also don't have NodeAddr but maybe that's ok. This means the IP addresses of the nodes in our cluster are hard-coded in /etc/hosts. Also, State is not given. So, if I formatted your's to look line our's would look something like: NodeName=radonc01 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 NodeName=radonc02 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 NodeName=radonc03 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 NodeName=radonc04 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 PartitionName=debug Nodes=radonc[01-04] Default=YES MaxTime=INFINITE State=UP Maybe the problem is with the NodeAddr because you might have to separate the values with a comma instead of a space? With spaces, it might have problems parsing? That's my guess... Ray