Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

Prentice Bisbal Fri, 22 Mar 2019 10:34:47 -0700

Rafael,

Most HPC centers have scheduled downtime on a regular basis. Typicallyit's one day a month, but I I know that at Argonne National Lab, whichis a DOE Leadership Computing Facility that house some of the largestsupercomputers in the world for use by a large number of scientists,they take their systems off-line every Monday for maintenance.

Having regularly scheduled maintenance outages is pretty much necessaryfor any large environment. Otherwise, the users never let you take theclusters offline for maintenance. Once the system is offline for a fewhours, a task like upgrading Slurm is pretty easy.

When I worked in a smaller environment, I didn't have regularlyscheduled outages, but due to the small size of the environment, it waseasy for me to ask/tell the users I needed to take the cluster off-line with a few days notice w/o any complaints from the users. In largerenvironments, you'll always get pushback, which is why creating a policyof regularly scheduled maintenance outages is necessary.


Prentice

On 3/22/19 7:07 AM, Frava wrote:

Hi all,
I think it's not that easy to keep SLURM up to date in a cluster ofmore than 3k nodes with a lot of users. I mean, that cluster has onlya little more than 2 years old and my today's submission got the JOBID12711473, the queue has 9769 jobs (squeue | wc -l). In two years therewere only two maintenances that impacted the users and each one wasannounced a few months prior. They told me that they actually plan toupdate SLURM but not until late 2019 because they have other things todo before that. Also, I'm the only one asking for heterogeneous jobs...
Rafael.
Le jeu. 21 mars 2019 à 22:19, Prentice Bisbal <pbis...@pppl.gov<mailto:pbis...@pppl.gov>> a écrit :
    On 3/21/19 4:40 PM, Reuti wrote:

    >> Am 21.03.2019 um 16:26 schrieb Prentice Bisbal
    <pbis...@pppl.gov <mailto:pbis...@pppl.gov>>:
    >>
    >>
    >> On 3/20/19 1:58 PM, Christopher Samuel wrote:
    >>> On 3/20/19 4:20 AM, Frava wrote:
    >>>
    >>>> Hi Chris, thank you for the reply.
    >>>> The team that manages that cluster is not very fond of
    upgrading SLURM, which I understand.
    >> As a system admin who manages clusters myself, I don't
    understand this. Our job is to provide and maintain resources for
    our users. Part of that maintenance is to provide updates for
    security, performance, and functionality (new features) reasons.
    HPC has always been a leading-edge kind if field, so I feel this
    is even more important for HPC admins.
    >>
    >> Yes, there can be issues caused by updates, but those can be
    with proper planning: Have a plan to do the actual upgrade, have a
    plan to test for issues, and have a plan to revert to an earlier
    version if issues are discovered. This is work, but it's really
    not all that much work, and this is exactly the work we are being
    paid to do as cluster admins.
    > Besides the work on the side of the admins, also the users are
    involved: exchanging libraries also means to run the test suites
    of their applications again.
    >
    > -- Reuti

    That implies the users actually wrote test suites. ;-)

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

Reply via email to