Howdy, and thanks for the warm welcome, On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameye...@gmail.com> wrote:
> Hi, > > Did you configure your node definition with the outputs of slurmd -C? > Ignore boards. Don't know if it is still true but several years ago > declaring boards made things difficult. > > $ slurmd -C NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311 UpTime=0-00:47:51 $ grep NodeName /etc/slurm-llnl/slurm.conf NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1 There is a difference. I, too, discarded the Boards and sockets in slurmd.conf . Is that the problem? > Also, if you have hyperthreaded AMD or Intel processors your partition > declaration should be overscribe:2 > > Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is set to show them as 64 cores. > Start with a very simple job with a script containing sleep 100 or > something else without any runtime issues. > > I ran this MPI hello world thing <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with this sbatch script. <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch> Should be the same thing as your suggestion, basically. Should I switch to 'srun' in the batch file? AR > When I started with slurm I built the sbatch one small step at a time. > Nodes, cores. memory, partition, mail, etc > > It sounds like your config is very close but your problem may be in the > submit script. > > Best of luck and welcome to slurm. It is very powerful with a huge > community. > > Doug > > > > On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldo...@gmail.com> > wrote: > >> Hi folks, >> >> I have a single-node "cluster" running Ubuntu 20.04 LTS with the >> distribution packages for slurm (slurm-wlm 19.05.5) >> Slurm only ran one job in the node at a time with the default >> configuration, leaving all other jobs pending. >> This happened even if that one job only requested like a few cores (the >> node has 64 cores, and slurm.conf is configged accordingly). >> >> in slurm conf, SelectType is set to select/cons_res, and >> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file >> is referenced below. >> >> So I set OverSubscribe=FORCE in the partition config and restarted the >> daemons. >> >> Multiple jobs are now run concurrently, but when Slurm is oversubscribed, >> it is *truly* *oversubscribed*. That is to say, it runs so many jobs >> that there are more processes running than cores/threads. >> How should I config slurm so that it runs multiple jobs at once per node, >> but ensures that it doesn't run more processes than there are cores? Is >> there some TRES magic for this that I can't seem to figure out? >> >> My slurm.conf is here on github: >> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf >> The only gres I've set is for the GPU: >> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf >> >> Thanks for your attention, >> Regards, >> AR >> -- >> Analabha Roy >> Assistant Professor >> Department of Physics >> <http://www.buruniv.ac.in/academics/department/physics> >> The University of Burdwan <http://www.buruniv.ac.in/> >> Golapbag Campus, Barddhaman 713104 >> West Bengal, India >> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >> hariseldo...@gmail.com >> Webpage: http://www.ph.utexas.edu/~daneel/ >> > -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/