Hi Guillaume,
in that example you wouldn't need the 'srun' to run more than one task,
I think.
I'm not 100% sure, but to me it sounds like you're currently assigning
whole nodes to jobs rather than cores (i.e have
'SelectType=select/linear' and no OverSubscribe) and find that to be
wasteful - is that correct?
If it is, I'd say the more obvious solution to that would be to change
the SelectType to either select/cons_res or select/cons_tres, so that
cores (not nodes) are allocated to jobs?
Tina
On 15/06/2022 13:20, Guillaume De Nayer wrote:
Dear all,
I'm new on this list. I am responsible for several small clusters at our
chair.
I set up slurm 21.08.8-2 on a small cluster (CentOS 7) with 8 nodes:
NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=1
One collegue has to run 20,000 jobs on this machine. Every job starts
his program with mpirun on 12 cores. The standard slurm behavior makes
that the node, which runs this job is blocked (and 28 cores are idle).
The small cluster has only 8 nodes, so only 8 jobs can run in parallel.
In order to solve this problem I'm trying to start some subtasks with
srun inside a batch job (without mpirun for now):
#!/bin/bash
#SBATCH --job-name=test_multi_prog_srun
#SBATCH --nodes=1
#SBATCH --partition=short
#SBATCH --time=02:00:00
#SBATCH --exclusive
srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
wait
However, only one task runs. The second is waiting for the completion of
the first task to start.
Can someone explain me, what I'm doing wrong?
Thx in advance,
Regards,
Guillaume
# slurm.conf file
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmUser=root
SwitchType=switch/none
TaskPlugin=task/none
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageEnforce=limits
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobAcctGatherFrequency=30
SlurmctldDebug=error
SlurmdDebug=error
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log
NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=1 State=UNKNOWN
PartitionName=short Nodes=node[01-08] Default=NO MaxTime=0-02:00:00
State=UP DefaultTime=00:00:00 MinNodes=1 PriorityTier=100
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk