Re: [slurm-users] Slurm Multi-cluster implementation

Tina Friedrich Wed, 03 Nov 2021 10:33:45 -0700

Thank you for that - I'm restricting things via limits.conf on the loginnodes at the moment, but have been considering using cgroups instead fora while. So this is very useful :)


If we're sharing details, our setup currently is:

2x2 cluster not-quite-federations - prod and dev, each with a'capability' and 'throughput' cluster.

The 'capability' system is homogeneous, and all nodes have low latencyinterconnect; it prefers big jobs. The 'throughput' system is moreheterogeneous (it's where all the GPUs live, as well as some CPU onlynodes); it prefers small jobs.

(dev system has the same cluster config, but not the same resources,obviously - it's for configuration testing, mostly, and for testingupgrades)

Both 'federations' have one database server (server running mariadb andslurmdbd) - so there's a 'prod' and a 'dev' database server.


All clusters have the same partitions & users & projects etc.

Each cluster has a dedicated hosts running slurmctld.

SLURM is installed locally - I build RPMs.

2 login nodes per prod cluster (only one overall for dev).

Apart from the actual worker nodes, all of these are VMs. I have nosecondary slurmctlds or anything as I'm more or less relying on theVMWare cluster to handle this. (And we've done live migrations on theVMWare end without problems.) The login nodes are doubled up as they getrebooted for security updates regularly, so to ensure people can alwayslog in we double those up.

Login nodes are not (!) on the same OS release as cluster nodes - theyrun the latest for security reasons. Software building etc happens on'interactive' nodes (...a partition that oversubscribes by default).

Nodes mount application shares by hardware architecture (skylake,broadwell, ...) - using autofs variables to pick up the correct share,so applications have the same path on all nodes, but what's mounted isthe correct build for the local architecture. (Using Easybuild & lmodfor applications.)

Updates always need to be database first, of course (don't they always);however,I can't quite confirm binaries not working with after upgradingthe database - we ran with a 20.11 database server & 20.02 everythingelse for a fair while (the change in MPI behaviour had us downgradeeverything apart from the database), so I've only ever seen the '-M'(and all accounting) not work during DB restarts.

They were meant to actually be federated, but it confused our users, soI broke the federation again (but left the rest of the setup in place).


Tina

On 01/11/2021 10:35, Yair Yarom wrote:

cpu limit using ulimit is pretty straightforward with pam_limits and/etc/security/limits.conf. On some of the login nodes we have a cpulimit of 10 minutes, so heavy processes will fail.

The memory was a bit more complicated (i.e. not pretty). We wanted thata user won't be able to use more than e.g. 1G for all processescombined. Using systemd we added the file/etc/systemd/system/user-.slice.d/20-memory.conf which contains:

[Slice]
MemoryLimit=1024M
MemoryAccounting=true

But we also wanted to restrict swap usage and we're still on cgroupv1,so systemd didn't help there. The ugly part comes with a pam_exec to ascript that updates the memsw limit of the cgroup for the above slice.The script does more things, but the swap section is more or less:


if [ "x$PAM_TYPE" = 'xopen_session' ]; then
     _id=`id -u $PAM_USER`
     if [ -z "$_id" ]; then
         exit 1
     fi

if [[ -e/sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes]]; then

         swap=$((1126 * 1024 * 1024))

echo $swap >/sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes

     fi
fi

On Sun, Oct 31, 2021 at 6:36 PM Brian Andrus <toomuc...@gmail.com<mailto:toomuc...@gmail.com>> wrote:


    That is interesting to me.

    How do you use ulimit and systemd to limit user usage on the login
    nodes? This sounds like something very useful.

    Brian Andrus

    On 10/31/2021 1:08 AM, Yair Yarom wrote:

    Hi,

    If it helps, this is our setup:
    6 clusters (actually a bit more)
    1 mysql + slurmdbd on the same host
    6 primary slurmctld on 3 hosts (need to make sure each have a
    distinct SlurmctldPort)
    6 secondary slurmctld on an arbitrary node on the clusters themselves.
    1 login node per cluster (this is a very small VM, and the users
    are limited both to cpu time (with ulimit) and memory (with systemd))
    The slurm.conf's are shared on nfs to everyone in
    /path/to/nfs/<cluster name>/slurm.conf. With symlink to /etc for
    the relevant cluster per node.

    The -M generally works, we can submit/query jobs from a login node
    of one cluster to another. But there's a caveat to notice when
    upgrading. slurmdbd must be upgraded first, but usually we have a
    not so small gap between upgrading the different clusters. This
    causes the -M to stop working because binaries of one version
    won't work on the other (I don't remember in which direction).
    We solved this by using an lmod module per cluster, which both
    sets the SLURM_CONF environment, and the PATH to the correct slurm
    binaries (which we install in /usr/local/slurm/<version>/ so that
    they co-exists). So when the -M won't work, users can use:
    module load slurm/clusterA
    squeue
    module load slurm/clusterB
    squeue

    BR,







    On Thu, Oct 28, 2021 at 7:39 PM navin srivastava
    <navin.alt...@gmail.com <mailto:navin.alt...@gmail.com>> wrote:

        Thank you Tina.
        It will really help

        Regards
        Navin

        On Thu, Oct 28, 2021, 22:01 Tina Friedrich
        <tina.friedr...@it.ox.ac.uk
        <mailto:tina.friedr...@it.ox.ac.uk>> wrote:

            Hello,

            I have the database on a separate server (it runs the
            database and the
            database only). The login nodes run nothing SLURM related,
            they simply
            have the binaries installed & a SLURM config.

            I've never looked into having multiple databases & using
            AccountingStorageExternalHost (in fact I'd forgotten you
            could do that),
            so I can't comment on that (maybe someone else can); I
            think that works,
            yes, but as I said never tested that (didn't see much
            point in running
            multiple databases if one would do the job).

            I actually have specific login nodes for both of my
            clusters, to make it
            easier for users (especially those with not much
            experience using the
            HPC environment); so I have one login node connecting to
            cluster 1 and
            one connecting to cluster 1.

            I think the relevant bits of slurm.conf Relevant config
            entries (if I'm
            not mistaken) on the login nodes are probably:

            The differences in the slurm config files (that haven't
            got to do with
            topology & nodes & scheduler tuning) are

            ClusterName=cluster1
            ControlMachine=cluster1-slurm
            ControlAddr=/IP_OF_SLURM_CONTROLLER/

            ClusterName=cluster2
            ControlMachine=cluster2-slurm
            ControlAddr=/IP_OF_SLURM_CONTROLLER/

            (where IP_OF_SLURM_CONTROLLER is the IP address of host
            cluster1-slurm,
            same for cluster2)

            And then the have common entries for the
            AccountingStorageHost:

            AccountingStorageHost=slurm-db-prod
            AccountingStorageBackupHost=slurm-db-prod
            AccountingStoragePort=7030
            AccountingStorageType=accounting_storage/slurmdbd

            (slurm-db-prod is simply the hostname of the SLURM
            database server)

            Does that help?

            Tina

            On 28/10/2021 14:59, navin srivastava wrote:
            > Thank you Tina.
            >
            > so if i understood correctly.Database is global to both
            cluster and
            > running on login Node?
            > or is the database running on one of the master Node and
            shared with
            > another master server Node?
            >
            > but as far I have read that the slurm database can also
            be separate on
            > both the master and just use the parameter
            > AccountingStorageExternalHost so that both databases are
            aware of each
            > other.
            >
            > Also on the login node in slurm .conf file pointed to
            which Slurmctld?
            > is it possible to share the  sample slurm.conf file of
            login Node.
            >
            > Regards
            > Navin.
            >
            >
            >
            >
            >
            >
            >
            >
            > On Thu, Oct 28, 2021 at 7:06 PM Tina Friedrich
            > <tina.friedr...@it.ox.ac.uk
            <mailto:tina.friedr...@it.ox.ac.uk>
            <mailto:tina.friedr...@it.ox.ac.uk
            <mailto:tina.friedr...@it.ox.ac.uk>>> wrote:
            >
            >     Hi Navin,
            >
            >     well, I have two clusters & login nodes that allow
            access to both. That
            >     do? I don't think a third would make any difference
            in setup.
            >
            >     They need to share a database. As long as the share
            a database, the
            >     clusters have 'knowledge' of each other.
            >
            >     So if you set up one database server (running
            slurmdbd), and then a
            >     SLURM controller for each cluster (running
            slurmctld) using that one
            >     central database, the '-M' option should work.
            >
            >     Tina
            >
            >     On 28/10/2021 10:54, navin srivastava wrote:
            >      > Hi ,
            >      >
            >      > I am looking for a stepwise guide to setup multi
            cluster
            >     implementation.
            >      > We wanted to set up 3 clusters and one Login Node
            to run the job
            >     using
            >      > -M cluster option.
            >      > can anybody have such a setup and can share some
            insight into how it
            >      > works and it is really a stable solution.
            >      >
            >      >
            >      > Regards
            >      > Navin.
            >
            >     --
            >     Tina Friedrich, Advanced Research Computing Snr HPC
            Systems
            >     Administrator
            >
            >     Research Computing and Support Services
            >     IT Services, University of Oxford
            > http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
            <http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>>
            > http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>
            <http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>>
            >

--Tina Friedrich, Advanced Research Computing Snr HPC

            Systems Administrator

            Research Computing and Support Services
            IT Services, University of Oxford
            http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
            http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>

--/| |

       \/        |Yair Yarom | System Group (DevOps)
       []        |The Rachel and Selim Benin School
       []  /\     |of Computer Science and Engineering
       []//\\/   |The Hebrew University of Jerusalem
       [//   \\   |T +972-2-5494522 | F +972-2-5494522
       //     \   |ir...@cs.huji.ac.il <mailto:ir...@cs.huji.ac.il>
      //         |




--

   /|        |
   \/        |Yair Yarom | System Group (DevOps)
   []        |The Rachel and Selim Benin School
   []  /\     |of Computer Science and Engineering
   []//\\/   |The Hebrew University of Jerusalem
   [//   \\   |T +972-2-5494522 | F +972-2-5494522
   //     \   |ir...@cs.huji.ac.il <mailto:ir...@cs.huji.ac.il>
  //         |


--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Re: [slurm-users] Slurm Multi-cluster implementation

Reply via email to