Hi Doug, Again, many thanks for your detailed response. Based on my understanding of your previous note, I did the following:
I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 and the partitions with oversubscribe=force:2 then I put further restrictions with the default qos to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2 That way, no single user can request more than 2 X 32 cores legally. I launched two jobs, sbatch -n 32 each as one user. They started running immediately, taking up all 64 cores. Then I logged in as another user and launched the same job with sbatch -n 2. To my dismay, it started to run! Shouldn't slurm have figured out that all 64 cores were occupied and queued the -n 2 job to pending? AR On Sun, 26 Feb 2023 at 02:18, Doug Meyer <dameye...@gmail.com> wrote: > Hi, > > You got me, I didn't know that " oversubscribe=FORCE:2" is an option. > I'll need to explore that. > > I missed the question about srun. srun is the preferred I believe. I am > not associated with drafting the submit scripts but can ask my peer. You > do need to stipulate the number of cores you want. Your "sbatch -n 1" > should be changed to the number of MPI ranks you desire. > > As good as slurm is, many come to assume it does far more than it does. I > explain slurm as a maƮtre d' in a very exclusive restaurant, aware of every > table and the resources they afford. When a reservation is placed, a job > submitted, a review of the request versus the resources matches the > pending guest/job against the resources and when the other diners/jobs are > expected to finish. If a guest requests resources that are not available > in the restaurant, the reservation is denied. If a guest arrives and does > not need all the resources, the place settings requested but unused are > left in reservation until the job finishes. Slurm manages requests against > an inventory. Without enforcement, a job that requests 1 core but uses 12 > will run. If your 64 core system accepts 64 single core reservations, > slurm believing 64 cores are needed, 64 jobs wll start. and then the wait > staff (the OS) is left to deal with 768 tasks running on 64 cores. It > becomes a sad comedy as the system will probably run out of RAM triggering > OOM killer or just run horribly slow. Never assume slurm is going to > prevent bad actors once they begin running unless you have configured it to > do so. > > We run a very lax environment. We set a standard of 6 GB per job unless > the sbatch declares otherwise and a max runtime default. Without an > estimated runtime to work with the backfill scheduler is crippled. In an > environment mixing single thread and MPI jobs of various sizes it is > critical the jobs are honest in their requirements providing slurm the > information needed to correctly assign resources. > > Doug > > On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy <hariseldo...@gmail.com> > wrote: > >> Hi, >> >> Thanks for your considered response. Couple of questions linger... >> >> On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameye...@gmail.com> wrote: >> >>> Hi, >>> >>> Declaring cores=64 will absolutely work but if you start running MPI >>> you'll want a more detailed config description. The easy way to read it is >>> "128=2 sockets * 32 corespersocket * 2 threads per core". >>> >>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32 >>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100 >>> >>> But if you just want to work with logical cores the "cpus=128" will work. >>> >>> If you go with the more detailed description then you need to declare >>> oversubscription (hyperthreading) in the partition declaration. >>> >> >> >> Yeah, I'll try that. >> >> >>> By default slurm will not let two different jobs share the logical cores >>> comprising a physical core. For example if Sue has an Array of 1-1000 her >>> array tasks could each take a logical core on a physical core. But if >>> Jamal is also running they would not be able to share the physical core. >>> (as I understand it). >>> >>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2 >>> MaxTime=Infinite State=Up AllowAccounts=cowboys >>> >>> >>> In the sbatch/srun the user needs to add a declaration >>> "oversubscribe=yes" telling slurm the job can run on both logical cores >>> available. >>> >> >> How about setting oversubscribe=FORCE:2? That way, users need not add a >> setting in their scripts. >> >> >> >> >>> In the days on Knight's Landing each core could handle four logical >>> cores but I don't believe there are any current AMD or Intel processors >>> supporting more then two logical cores (hyperthreads per core). The >>> conversation about hyperthreads is difficult as the Intel terminology is >>> logical cores for hyperthreading and cores for physical cores but the >>> tendency is to call the logical cores threads or hyperthreaded cores. This >>> can be very confusing for consumers of the resources. >>> >>> >>> In any case, if you create an array job of 1-100 sleep jobs, my simplest >>> logical test job, then you can use scontrol show node <nodename> to see the >>> nodes resource configuration as well as consumption. squeue -w <nodename> >>> -i 10 will iteratate every ten seconds to show you the node chomping >>> through the job. >>> >>> >>> Hope this helps. Once you are comfortable I would urge you to use the >>> NodeName/Partition descriptor format above and encourage your users to >>> declare oversubscription in their jobs. It is a little more work up front >>> but far easier than correcting scripts later. >>> >>> >>> Doug >>> >>> >>> >>> >>> >>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldo...@gmail.com> >>> wrote: >>> >>>> Howdy, and thanks for the warm welcome, >>>> >>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameye...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Did you configure your node definition with the outputs of slurmd -C? >>>>> Ignore boards. Don't know if it is still true but several years ago >>>>> declaring boards made things difficult. >>>>> >>>>> >>>> $ slurmd -C >>>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311 >>>> UpTime=0-00:47:51 >>>> $ grep NodeName /etc/slurm-llnl/slurm.conf >>>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1 >>>> >>>> There is a difference. I, too, discarded the Boards and sockets in >>>> slurmd.conf . Is that the problem? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>> Also, if you have hyperthreaded AMD or Intel processors your partition >>>>> declaration should be overscribe:2 >>>>> >>>>> >>>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS >>>> is set to show them as 64 cores. >>>> >>>> >>>> >>>> >>>>> Start with a very simple job with a script containing sleep 100 or >>>>> something else without any runtime issues. >>>>> >>>>> >>>> I ran this MPI hello world thing >>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with >>>> this sbatch script. >>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch> >>>> Should be the same thing as your suggestion, basically. >>>> Should I switch to 'srun' in the batch file? >>>> >>>> AR >>>> >>>> >>>>> When I started with slurm I built the sbatch one small step at a >>>>> time. Nodes, cores. memory, partition, mail, etc >>>>> >>>>> It sounds like your config is very close but your problem may be in >>>>> the submit script. >>>>> >>>>> Best of luck and welcome to slurm. It is very powerful with a huge >>>>> community. >>>>> >>>>> Doug >>>>> >>>>> >>>>> >>>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldo...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi folks, >>>>>> >>>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the >>>>>> distribution packages for slurm (slurm-wlm 19.05.5) >>>>>> Slurm only ran one job in the node at a time with the default >>>>>> configuration, leaving all other jobs pending. >>>>>> This happened even if that one job only requested like a few cores >>>>>> (the node has 64 cores, and slurm.conf is configged accordingly). >>>>>> >>>>>> in slurm conf, SelectType is set to select/cons_res, and >>>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to >>>>>> file >>>>>> is referenced below. >>>>>> >>>>>> So I set OverSubscribe=FORCE in the partition config and restarted >>>>>> the daemons. >>>>>> >>>>>> Multiple jobs are now run concurrently, but when Slurm is >>>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it >>>>>> runs so many jobs that there are more processes running than >>>>>> cores/threads. >>>>>> How should I config slurm so that it runs multiple jobs at once per >>>>>> node, but ensures that it doesn't run more processes than there are >>>>>> cores? >>>>>> Is there some TRES magic for this that I can't seem to figure out? >>>>>> >>>>>> My slurm.conf is here on github: >>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf >>>>>> The only gres I've set is for the GPU: >>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf >>>>>> >>>>>> Thanks for your attention, >>>>>> Regards, >>>>>> AR >>>>>> -- >>>>>> Analabha Roy >>>>>> Assistant Professor >>>>>> Department of Physics >>>>>> <http://www.buruniv.ac.in/academics/department/physics> >>>>>> The University of Burdwan <http://www.buruniv.ac.in/> >>>>>> Golapbag Campus, Barddhaman 713104 >>>>>> West Bengal, India >>>>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>>>>> hariseldo...@gmail.com >>>>>> Webpage: http://www.ph.utexas.edu/~daneel/ >>>>>> >>>>> >>>> >>>> -- >>>> Analabha Roy >>>> Assistant Professor >>>> Department of Physics >>>> <http://www.buruniv.ac.in/academics/department/physics> >>>> The University of Burdwan <http://www.buruniv.ac.in/> >>>> Golapbag Campus, Barddhaman 713104 >>>> West Bengal, India >>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>>> hariseldo...@gmail.com >>>> Webpage: http://www.ph.utexas.edu/~daneel/ >>>> >>> >> >> -- >> Analabha Roy >> Assistant Professor >> Department of Physics >> <http://www.buruniv.ac.in/academics/department/physics> >> The University of Burdwan <http://www.buruniv.ac.in/> >> Golapbag Campus, Barddhaman 713104 >> West Bengal, India >> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >> hariseldo...@gmail.com >> Webpage: http://www.ph.utexas.edu/~daneel/ >> > -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/