Following up on this in case anyone can provide some insight, please.
On Thu, May 16, 2024 at 8:32 AM Dan Healy <[email protected]> wrote:
> Hi there, SLURM community,
>
> I swear I've done this before, but now it's failing on a new cluster I'm
> deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I
> run `srun -n 500 hostname`, the task gets queued since there's not 500
> available CPU.
>
> Wasn't there an option that allows for this to be run where the first 384
> tasks execute, and then the remaining execute when resources free up?
>
> Here's my conf:
>
> # Slurm Cgroup Configs used on controllers and workersslurm_cgroup_config:
> CgroupAutomount: yes ConstrainCores: yes ConstrainRAMSpace: yes
> ConstrainSwapSpace: yes ConstrainDevices: yes# Slurm conf file
> settingsslurm_config: AccountingStorageType: "accounting_storage/slurmdbd"
> AccountingStorageEnforce: "limits" AuthAltTypes: "auth/jwt" ClusterName:
> "cluster" AccountingStorageHost : "{{
> hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU: 1024
> InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup" JobCompType:
> "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize: 40000 MaxJobCount:
> 100000 MinJobAge: 3600 ProctrackType: "proctrack/cgroup" ReturnToService:
> 2 SelectType: "select/cons_tres" SelectTypeParameters: "CR_Core_Memory"
> SlurmctldTimeout: 30 SlurmctldLogFile: "/var/log/slurm/slurmctld.log"
> SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir:
> "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name }}" SrunPortRange:
> "60000-61000" StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin:
> "task/affinity,task/cgroup" UnkillableStepTimeout: 120
>
>
> --
> Thanks,
>
> Daniel Healy
>
--
Thanks,
Daniel Healy
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]