We don't do anything.  In our environment it is the user's responsibility to optimize their code appropriately.  Since we have a great variety of hardware any modules we build (we have several thousand of them) are all build generically.  If people want processor specific optimizations then they have to build their own stack.

-Paul Edmon-

On 6/20/19 11:07 AM, Fulcomer, Samuel wrote:
...ah, got it. I was confused by "PI/Lab nodes" in your partition list.

Our QoS/account pair for each investigator condo is our approximate equivalent of what you're doing with owned partitions.

Since we have everything in one partition we segregate processor types via topology.conf. We break up topology.conf further to keep MPI jobs on the same switch.

On another topic, how do you address code optimization for processor type? We've been mostly linking with MKL and relying on its muti-code-path.

Regards,
Sam

On Thu, Jun 20, 2019 at 10:20 AM Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote:

    People will specify which partition they need or if they want
    multiple they use this:

    #SBATCH -p general,shared,serial_requeue

    As then the scheduler will just select which partition they will
    run in first.  Naturally there is a risk that you will end up
    running in a more expensive partition.

    Our time limit is only applied to our public partitions, our owned
    partitions (of which we have roughly 80) have no time limit.  So
    if they run on their dedicated resources they have no penalty. 
    We've been working on getting rid of owned partitions and moving
    to a school/department based partition, where all the purchased
    resources for different PI's go into the same bucket where they
    compete against themselves and not the wider community.  We've
    found that this ends up working pretty well as most PI's only used
    their purchased resources sporadically.  Thus there are usually
    idle cores lying around that we backfill with our serial queues. 
    Since those are requeueable we can get immediate response to
    access that idle space.  We are also toying with a high priority
    partition that is open to people with high fairshare so that they
    can get immediate response as those with high fairshare tend to be
    bursty users.

    Our current halflife is set to a month and we keep 6 months of
    data in our database.  I'd actually like to get rid of the
    halflife and just go to a 3 month moving window to allow people to
    bank their fairshare, but we haven't done that yet as people have
    been having a hard enough time understanding our current system. 
    It's not due to its complexity but more that most people just flat
    out aren't cognizant of their usage and think the resource is
    functionally infinite.

    -Paul Edmon-

    On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:
    Hi Paul,

    Thanks..Your setup is interesting. I see that you have your
    processor types segregated in their own partitions (with the
    exception of of the requeue partition), and that's how you get at
    the weighting mechanism. Do you have your users explicitly
    specify multiple partitions in the batch commands/scripts in
    order to take advantage of this, or do you use a plugin for it?

    It sounds like you don't impose any hard limit on simultaneous
    resource use, and allow everything to fairshare out with the help
    of the 7 day TimeLimit. We haven't been imposing any TimeLimit on
    our condo users, which would be an issue for us with your config.
    For our exploratory and priority users, we impose an effective
    time limit with GrpTRESRunMins=cpu (and gres/gpu= for the GPU
    usage). In addition, since we have so many priority users, we
    don't explicitly set a rawshare value for them (they all execute
    under the "default" account). We set rawshare for the condo
    accounts as cores-purchased/total-cores*1000.

    What's your fairshare decay setting (don't remember the proper
    name at the moment)?

    Regards,
    Sam



    On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon
    <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote:

        We do a similar thing here at Harvard:

        https://www.rc.fas.harvard.edu/fairshare/

        We simply weight all the partitions based on their core type
        and then we allocate Shares for each account based on what
        they have purchased.  We don't use QoS at all, so we just
        rely purely on fairshare weighting for resource usage.  It
        has worked pretty well for our purposes.

        -Paul Edmon-

        On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:

        (...and yes, the name is inspired by a certain OEM's
        software licensing schemes...)

        At Brown we run a ~400 node cluster containing nodes of
        multiple architectures (Sandy/Ivy, Haswell/Broadwell, and
        Sky/Cascade) purchased in some cases by University funds and
        in others by investigator funding (~50:50).  They all appear
        in the default SLURM partition. We have 3 classes of SLURM
        users:

         1. Exploratory - no-charge access to up to 16 cores
         2. Priority - $750/quarter for access to up to 192 cores
            (and with a GrpTRESRunMins=cpu limit). Each user has
            their own QoS
         3. Condo - an investigator group who paid for nodes added
            to the cluster. The group has its own QoS and SLURM
            Account. The QoS allows use of the number of cores
            purchased and has a much higher priority than the QoS'
            of the "priority" users.

        The first problem with this scheme is that condo users who
        have purchased the older hardware now have access to the
        newest without penalty. In addition, we're encountering
        resistance to the idea of turning off their hardware and
        terminating their condos (despite MOUs stating a 5yr life).
        The pushback is the stated belief that the hardware should
        run until it dies.

        What I propose is a new TRES called a Processor Performance
        Unit (PPU) that would be specified on the Node line in
        slurm.conf, and used such that GrpTRES=ppu=N was calculated
        as the number of allocated cores multiplied by their
        associated PPU numbers.

        We could then assign a base PPU to the oldest hardware, say,
        "1" for Sandy/Ivy and increase for later architectures based
        on performance improvement. We'd set the condo QoS to
        GrpTRES=ppu=N*X+M*Y,..., where N is the number of cores of
        the oldest architecture multiplied by the configured
        PPU/core, X, and repeat for any newer nodes/cores the
        investigator has purchased since.

        The result is that the investigator group gets to run on an
        approximation of the performance that they've purchased,
        rather on the raw purchased core count.

        Thoughts?


Reply via email to