Hi Emre, MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to a higher value? And recompile slurm.....
Regards, Karl On 14/09/2021 21:47, Emre Brookes wrote: > *-O*, *--overcommit* > Overcommit resources. When applied to job allocation, only one CPU > is allocated to the job per node and options used to specify the > number of tasks per node, socket, core, etc. are ignored. When > applied to job step allocations (the *srun* command when executed > within an existing job allocation), this option can be used to > launch more than one task per CPU. Normally, *srun* will not > allocate more than one process per CPU. By specifying *--overcommit* > you are explicitly allowing more than one process per CPU. However > no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per > node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/ > and is not a variable, it is set at Slurm build time. > > I have used this successfully to run more jobs than cpus/cores avail. > > -e. > > > > Karl Lovink wrote: >> Hello, >> >> I am in the process of setting up our SLURM environment. We want to use >> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We >> need a lot of parallel running jobs on a total of 3 nodes.I can't get it >> to run more than 128 jobs simultaneously. There are 128 cpu's in the >> compute nodes. >> >> How can I ensure that I can run more jobs in parallel than there are >> CPUs in the compute node? >> >> Thanks >> Karl >> >> >> My srun script is: >> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh >> >> And my slurm.conf file: >> ClusterName=ddos-cluster >> ControlMachine=slurm >> SlurmUser=ddos >> SlurmctldPort=6817 >> SlurmdPort=6818 >> AuthType=auth/munge >> StateSaveLocation=/opt/slurm/spool/ctld >> SlurmdSpoolDir=/opt/slurm/spool/d >> SwitchType=switch/none >> MpiDefault=none >> SlurmctldPidFile=/opt/slurm/run/.pid >> SlurmdPidFile=/opt/slurm/run/slurmd.pid >> ProctrackType=proctrack/pgid >> PluginDir=/opt/slurm/lib/slurm >> ReturnToService=2 >> TaskPlugin=task/none >> SlurmctldTimeout=300 >> SlurmdTimeout=300 >> InactiveLimit=0 >> MinJobAge=300 >> KillWait=30 >> Waittime=0 >> SchedulerType=sched/backfill >> >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core >> >> SlurmctldDebug=3 >> SlurmctldLogFile=/opt/slurm/log/slurmctld.log >> SlurmdDebug=3 >> SlurmdLogFile=/opt/slurm/log/slurmd.log >> JobCompType=jobcomp/none >> JobAcctGatherType=jobacct_gather/none >> AccountingStorageTRES=gres/gpu >> DebugFlags=CPU_Bind,gres >> AccountingStorageType=accounting_storage/slurmdbd >> AccountingStorageHost=localhost >> AccountingStoragePass=/var/run/munge/munge.socket.2 >> AccountingStorageUser=slurm >> SlurmctldParameters=enable_configurable >> GresTypes=gpu >> DefMemPerNode=256000 >> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> >> . >> >