Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

Paul Edmon Thu, 20 Jun 2019 07:24:27 -0700

People will specify which partition they need or if they want multiplethey use this:


#SBATCH -p general,shared,serial_requeue

As then the scheduler will just select which partition they will run infirst. Naturally there is a risk that you will end up running in a moreexpensive partition.

Our time limit is only applied to our public partitions, our ownedpartitions (of which we have roughly 80) have no time limit. So if theyrun on their dedicated resources they have no penalty. We've beenworking on getting rid of owned partitions and moving to aschool/department based partition, where all the purchased resources fordifferent PI's go into the same bucket where they compete againstthemselves and not the wider community. We've found that this ends upworking pretty well as most PI's only used their purchased resourcessporadically. Thus there are usually idle cores lying around that webackfill with our serial queues. Since those are requeueable we can getimmediate response to access that idle space. We are also toying with ahigh priority partition that is open to people with high fairshare sothat they can get immediate response as those with high fairshare tendto be bursty users.

Our current halflife is set to a month and we keep 6 months of data inour database. I'd actually like to get rid of the halflife and just goto a 3 month moving window to allow people to bank their fairshare, butwe haven't done that yet as people have been having a hard enough timeunderstanding our current system. It's not due to its complexity butmore that most people just flat out aren't cognizant of their usage andthink the resource is functionally infinite.


-Paul Edmon-

On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:

Hi Paul,
Thanks..Your setup is interesting. I see that you have your processortypes segregated in their own partitions (with the exception of of therequeue partition), and that's how you get at the weighting mechanism.Do you have your users explicitly specify multiple partitions in thebatch commands/scripts in order to take advantage of this, or do youuse a plugin for it?
It sounds like you don't impose any hard limit on simultaneousresource use, and allow everything to fairshare out with the help ofthe 7 day TimeLimit. We haven't been imposing any TimeLimit on ourcondo users, which would be an issue for us with your config. For ourexploratory and priority users, we impose an effective time limit withGrpTRESRunMins=cpu (and gres/gpu= for the GPU usage). In addition,since we have so many priority users, we don't explicitly set arawshare value for them (they all execute under the "default"account). We set rawshare for the condo accounts ascores-purchased/total-cores*1000.
What's your fairshare decay setting (don't remember the proper name atthe moment)?
Regards,
Sam
On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon <ped...@cfa.harvard.edu<mailto:ped...@cfa.harvard.edu>> wrote:
    We do a similar thing here at Harvard:

    https://www.rc.fas.harvard.edu/fairshare/

    We simply weight all the partitions based on their core type and
    then we allocate Shares for each account based on what they have
    purchased.  We don't use QoS at all, so we just rely purely on
    fairshare weighting for resource usage.  It has worked pretty well
    for our purposes.

    -Paul Edmon-

    On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
    (...and yes, the name is inspired by a certain OEM's software
    licensing schemes...)

    At Brown we run a ~400 node cluster containing nodes of multiple
    architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade)
    purchased in some cases by University funds and in others by
    investigator funding (~50:50).  They all appear in the default
    SLURM partition. We have 3 classes of SLURM users:

     1. Exploratory - no-charge access to up to 16 cores
     2. Priority - $750/quarter for access to up to 192 cores (and
        with a GrpTRESRunMins=cpu limit). Each user has their own QoS
     3. Condo - an investigator group who paid for nodes added to the
        cluster. The group has its own QoS and SLURM Account. The QoS
        allows use of the number of cores purchased and has a much
        higher priority than the QoS' of the "priority" users.

    The first problem with this scheme is that condo users who have
    purchased the older hardware now have access to the newest
    without penalty. In addition, we're encountering resistance to
    the idea of turning off their hardware and terminating their
    condos (despite MOUs stating a 5yr life). The pushback is the
    stated belief that the hardware should run until it dies.

    What I propose is a new TRES called a Processor Performance Unit
    (PPU) that would be specified on the Node line in slurm.conf, and
    used such that GrpTRES=ppu=N was calculated as the number of
    allocated cores multiplied by their associated PPU numbers.

    We could then assign a base PPU to the oldest hardware, say, "1"
    for Sandy/Ivy and increase for later architectures based on
    performance improvement. We'd set the condo QoS to
    GrpTRES=ppu=N*X+M*Y,..., where N is the number of cores of the
    oldest architecture multiplied by the configured PPU/core, X, and
    repeat for any newer nodes/cores the investigator has purchased
    since.

    The result is that the investigator group gets to run on an
    approximation of the performance that they've purchased, rather
    on the raw purchased core count.

    Thoughts?

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

Reply via email to