On a single Rocky8 workstation with one GPU where we wanted ssh interactive logins to it to have a small portion of its resources (shell, compiling, simple data manipulations, console desktop, etc) and the rest for SLURM we did this:

- Set it to use cgroupv2
  * modify /etc/defaultg/grub to add  systemd.unified_cgroup_hierarchy=1
    to GRUB_CMDLINE_LINUX.  Remake grub with grub2-mkconfig
  * create file /usr/etc/cgroup_cpuset_init with the lines

#!/bin/bash
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/system.slice/cgroup.subtree_control

  * Modify/create /etc/systemd/system/slurmd.service.d/override.conf
    so it has:

[Service]
ExecStartPre=-/usr/etc/cgroup_cpuset_init

- figure out exact cores to use for "free user" use and cores for SLURM.
  Also use GPU sharding in SLURM so GPU can be shared.

  * install hwloc-ls
  * run 'hwloc-ls' to tranlate physical cores 0-9 to logical cores
    For me P 0-9 was Logical 0,2,4,6,8,10,12,14,16,18
  * in /etc/slurm.conf the NodeName definition has

    CPUs=128 Boards=1 SocketsPerBoard=1 CoresPerSocket=64 ThreadsPerCore=2 \
    RealMemory=257267 MemSpecLimit=20480 \
    CpuSpecList=0,2,4,6,8,10,12,14,16,18 \
    TmpDisk=6000000 Gres=gpu:nvidia_a2:1,shard:nvidia_a2:32

    reserving those 10 cores and 20GB of RAM for "free user"

  * gres.conf has the lines:

AutoDetect=nvml
Name=shard Count=32

  * Need to add gres/shard to GresTypes= too.  Job submissions use
    the option --gres=shard:N where N is less than 32

- Set up systemd to restrict "free users" to cores 0-9 and the 20GB

  * Run: systemctl set-property user.slice MemoryHigh=20480M
  * Run for every individual user on the system

    systemctl set-property user-$uid.slice AllowedCPUs=0-9

    where $uid is that users user ID.  We do this in a script
    that also runs sacctmgr to add them to the SLURM system

    I could not just set this one for user.slice itself which is what I
    first tried because it then restricted the root user too and that
    cause wierd behavior with a lot of system tools.  So far the
    root/daemon process work fine in the 20GB limit though so that
    MemoryHigh=20480M is one and done

Then reboot.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to