Hi there, SLURM community,
I swear I've done this before, but now it's failing on a new cluster I'm
deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I
run `srun -n 500 hostname`, the task gets queued since there's not 500
available CPU.
Wasn't there an option that allows for this to be run where the first 384
tasks execute, and then the remaining execute when resources free up?
Here's my conf:
# Slurm Cgroup Configs used on controllers and
workersslurm_cgroup_config: CgroupAutomount: yes ConstrainCores: yes
ConstrainRAMSpace: yes ConstrainSwapSpace: yes ConstrainDevices:
yes# Slurm conf file settingsslurm_config: AccountingStorageType:
"accounting_storage/slurmdbd" AccountingStorageEnforce: "limits"
AuthAltTypes: "auth/jwt" ClusterName: "cluster"
AccountingStorageHost : "{{
hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU:
1024 InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup"
JobCompType: "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize:
40000 MaxJobCount: 100000 MinJobAge: 3600 ProctrackType:
"proctrack/cgroup" ReturnToService: 2 SelectType: "select/cons_tres"
SelectTypeParameters: "CR_Core_Memory" SlurmctldTimeout: 30
SlurmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile:
"/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d"
SlurmUser: "{{ slurm_user.name }}" SrunPortRange: "60000-61000"
StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin:
"task/affinity,task/cgroup" UnkillableStepTimeout: 120
--
Thanks,
Daniel Healy
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]