Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-03-02 Thread Analabha Roy
On Wed, 1 Mar 2023 at 07:51, Doug Meyer wrote: > Hi, > > I forgot one thing you didn't mention. When you change the node > descriptors and partitions you have to also restart slurmctld. scontrol > reconfigure works for the nodes but the main daemon has to be told to > reread the config. Until

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-28 Thread Doug Meyer
Hi, I forgot one thing you didn't mention. When you change the node descriptors and partitions you have to also restart slurmctld. scontrol reconfigure works for the nodes but the main daemon has to be told to reread the config. Until you restart the daemon it will be referencing the config fro

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-26 Thread Analabha Roy
Hey, Thanks for sticking with this. On Sun, 26 Feb 2023 at 23:43, Doug Meyer wrote: > Hi, > > Suggest removing "boards=1", The docs say to include it but in previous > discussions with schedmd we were advised to remove it. > > I just did. Then ran scontrol reconfigure. > When you are runni

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-26 Thread Doug Meyer
Hi, Suggest removing "boards=1", The docs say to include it but in previous discussions with schedmd we were advised to remove it. When you are running execute "scontrol show node " and look at the lines ConfigTres and AllocTres. The former is what the maitre d believes is available, the latter

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-26 Thread Analabha Roy
Hi Doug, Again, many thanks for your detailed response. Based on my understanding of your previous note, I did the following: I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 and the partitions with oversubscribe=force:2 then I put further restrictio

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Doug Meyer
Hi, You got me, I didn't know that " oversubscribe=FORCE:2" is an option. I'll need to explore that. I missed the question about srun. srun is the preferred I believe. I am not associated with drafting the submit scripts but can ask my peer. You do need to stipulate the number of cores you wa

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Analabha Roy
Hi, Thanks for your considered response. Couple of questions linger... On Sat, 25 Feb 2023 at 21:46, Doug Meyer wrote: > Hi, > > Declaring cores=64 will absolutely work but if you start running MPI > you'll want a more detailed config description. The easy way to read it is > "128=2 sockets *

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Doug Meyer
Hi, Declaring cores=64 will absolutely work but if you start running MPI you'll want a more detailed config description. The easy way to read it is "128=2 sockets * 32 corespersocket * 2 threads per core". NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=512

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-23 Thread Analabha Roy
Howdy, and thanks for the warm welcome, On Fri, 24 Feb 2023 at 07:31, Doug Meyer wrote: > Hi, > > Did you configure your node definition with the outputs of slurmd -C? > Ignore boards. Don't know if it is still true but several years ago > declaring boards made things difficult. > > $ slurmd -C

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-23 Thread Doug Meyer
Hi, Did you configure your node definition with the outputs of slurmd -C? Ignore boards. Don't know if it is still true but several years ago declaring boards made things difficult. Also, if you have hyperthreaded AMD or Intel processors your partition declaration should be overscribe:2 Start w

[slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-23 Thread Analabha Roy
Hi folks, I have a single-node "cluster" running Ubuntu 20.04 LTS with the distribution packages for slurm (slurm-wlm 19.05.5) Slurm only ran one job in the node at a time with the default configuration, leaving all other jobs pending. This happened even if that one job only requested like a few c