[slurm-users] Re: Executing srun -n X where X is greater than total CPU in entire cluster

Diego Zuccato via slurm-users Thu, 30 May 2024 03:11:10 -0700

IIUC you can't do that.

You either allow overcommit or you split your job in multiple, smallerjobs that fit.

The resources you're requesting must be available at the same time: ifyour job needs 2 CPUs and you want to run it in parallel, just use a jobarray. If you request 500 CPUs it means your job can not run with just 384.


Diego

Il 30/05/2024 11:41, Dan Healy via slurm-users ha scritto:

Following up on this in case anyone can provide some insight, please.

On Thu, May 16, 2024 at 8:32 AM Dan Healy <daniel.t.he...@gmail.com<mailto:daniel.t.he...@gmail.com>> wrote:


    Hi there, SLURM community,

    I swear I've done this before, but now it's failing on a new cluster
    I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU
    total). When I run `srun -n 500 hostname`, the task gets queued
    since there's not 500 available CPU.

    Wasn't there an option that allows for this to be run where the
    first 384 tasks execute, and then the remaining execute when
    resources free up?

    Here's my conf:

    # Slurm Cgroup Configs used on controllers and workers
    slurm_cgroup_config:
    CgroupAutomount: yes
    ConstrainCores: yes
    ConstrainRAMSpace: yes
    ConstrainSwapSpace: yes
    ConstrainDevices: yes

    # Slurm conf file settings
    slurm_config:
    AccountingStorageType: "accounting_storage/slurmdbd"
    AccountingStorageEnforce: "limits"
    AuthAltTypes: "auth/jwt"
    ClusterName: "cluster"
    AccountingStorageHost : "{{
    hostvars[groups['controller'][0]].ansible_hostname }}"
    DefMemPerCPU: 1024
    InactiveLimit: 120
    JobAcctGatherType: "jobacct_gather/cgroup"
    JobCompType: "jobcomp/none"
    MailProg: "/usr/bin/mail"
    MaxArraySize: 40000
    MaxJobCount: 100000
    MinJobAge: 3600
    ProctrackType: "proctrack/cgroup"
    ReturnToService: 2
    SelectType: "select/cons_tres"
    SelectTypeParameters: "CR_Core_Memory"
    SlurmctldTimeout: 30
    SlurmctldLogFile: "/var/log/slurm/slurmctld.log"
    SlurmdLogFile: "/var/log/slurm/slurmd.log"
    SlurmdSpoolDir: "/var/spool/slurm/d"
    SlurmUser: "{{ slurm_user.name <http://slurm_user.name> }}"
    SrunPortRange: "60000-61000"
    StateSaveLocation: "/var/spool/slurm/ctld"
    TaskPlugin: "task/affinity,task/cgroup"
    UnkillableStepTimeout: 120

--Thanks,


    Daniel Healy



--
Thanks,

Daniel Healy


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Executing srun -n X where X is greater than total CPU in entire cluster

Reply via email to