My goal is to set up a single server `slurm` cluster (only using a single computer) that can run multiple jobs in parallel.
In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel if they use a single core. In order to do it I run the controller and the worker daemon on the same node. When I submit four jobs at the same time, only one of them is able to run and the other three are not able to run due to the following error: `queued and waiting for resources`. I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was working on tag version `<=19`: ``` $ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm $ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f #slurm-19-05-8-1` $ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd $ sudo make && sudo make install ``` but does not work on higher versions like `slurm 20.02.1` or its `master` branch. ------ ``` ❯ sinfo Sat Nov 06 14:17:04 2021 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON home1 1 debug* idle 1 1:1:1 1 0 1 (null) none home2 1 debug* idle 1 1:1:1 1 0 1 (null) none home3 1 debug* idle 1 1:1:1 1 0 1 (null) none home4 1 debug* idle 1 1:1:1 1 0 1 (null) none $ srun -N1 sleep 10 # runs $ srun -N1 sleep 10 # queued and waiting for resources $ srun -N1 sleep 10 # queued and waiting for resources $ srun -N1 sleep 10 # queued and waiting for resources ``` Here, I get lost where since its [emulate-mode][1] they should be able to run in parallel. They way I build from the source-code: ```bash git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm ./configure --enable-debug --enable-multiple-slurmd make sudo make install ``` -------- ``` $ hostname -s home $ nproc 4 ``` ##### Compute_node setup: ``` NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1 ThreadsPerCore=1 Port=17001 PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP Shared=FORCE:1 ``` I have also tried: `NodeHostName=localhost` `slurm.conf` file: ```bash ControlMachine=home # $(hostname -s) ControlAddr=127.0.0.1 ClusterName=cluster SlurmUser=alper MailProg=/home/user/slurm_mail_prog.sh MinJobAge=172800 # 48 h SlurmdSpoolDir=/var/spool/slurmd SlurmdLogFile=/var/log/slurm/slurmd.%n.log SlurmdPidFile=/var/run/slurmd.%n.pid AuthType=auth/munge CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPort=6820 SlurmctldPort=6821 StateSaveLocation=/tmp/slurmstate SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 Waittime=0 SchedulerType=sched/backfill SelectType=select/linear PriorityDecayHalfLife=0 PriorityUsageResetPeriod=NONE AccountingStorageEnforce=limits AccountingStorageType=accounting_storage/slurmdbd AccountingStoreFlags=YES JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2 ThreadsPerCore=1 Port=17001 PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP Shared=FORCE:1 ``` `slurmdbd.conf`: ```bash AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdAddr=localhost DbdHost=localhost SlurmUser=alper DebugLevel=4 LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageUser=alper StoragePass=12345 ``` The way I run slurm: ``` sudo /usr/local/sbin/slurmd sudo /usr/local/sbin/slurmdbd & sudo /usr/local/sbin/slurmctld -cDvvvvvv ``` --------- Related: - minimum number of computers for a slurm cluster ( https://stackoverflow.com/a/27788311/2402577) - [Running multiple worker daemons SLURM]( https://stackoverflow.com/a/40707189/2402577) - https://stackoverflow.com/a/47009930/2402577 [1]: https://slurm.schedmd.com/faq.html#multi_slurmd