Re: [slurm-users] How to automatically release jobs that failed with "launch failed requeued held"

2019-01-22 Thread Doug Meyer
scontrol release job n Not sure if the system can be set to automatically release jobs but I would not want them too as a faulty system will go into a do loop start, fail, start. Doug On Tue, Jan 22, 2019 at 10:45 AM Roger Moye wrote: > This morning we had several jobs fail with “launch fa

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-22 Thread Cyrus Proctor
Hi Prentice, Have you considered Slurm features and constraints at all? You provide features (arbitrary strings in your slurm.conf) of what your hardware can provide ("amd", "ib", "FAST", "whatever"). A user then will list constraints using typical and/or/regex notation ( --constraint=amd&ib ).

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-22 Thread Prentice Bisbal
I left out a a *very* critical detail: One of the reasons I'm looking at revamping my Slurm configuration is that my users have requested the capability to submit long-running, low-priority interruptible jobs that can be killed and requeued when shorter-running, higher-priority jobs need to use

Re: [slurm-users] Topology configuration questions:

2019-01-22 Thread Ryan Novosielski
Prentice (and others) — if the NodeWeight/topology plugin interaction bothers you, feel free to tack onto bug 6384. https://bugs.schedmd.com/show_bug.cgi?id=6384 > On Jan 22, 2019, at 1:15 PM, Prentice Bisbal wrote: > > Killian, > > Thanks for the input. Unfortunately, all of this information

[slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-22 Thread Prentice Bisbal
Slurm Users, I would like your input on the best way to configure Slurm for a heterogeneous cluster I am responsible for. This e-mail will probably be a bit long to include all the necessary details of my environment so thanks in advance to those of you who read all of it! The cluster I supp

Re: [slurm-users] Topology configuration questions:

2019-01-22 Thread Prentice Bisbal
Killian, Thanks for the input. Unfortunately, all of this information from you, Ryan and others, is really ruining my plans, since it makes it look like my plan to fix a problem wit my cluster will not be as easy to fix as I'd hoped. One of the issues with my "Frankencluster" is that I'd like

Re: [slurm-users] Topology configuration questions:

2019-01-22 Thread Prentice Bisbal
Ryan, Thanks for looking into this. I hadn't had a chance to revisit the documentation since posing my question. Thanks for doing that for me. Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 1/18/19 2:58 PM, Ryan Novosielski wrote: The docume

[slurm-users] How to automatically release jobs that failed with "launch failed requeued held"

2019-01-22 Thread Roger Moye
This morning we had several jobs fail with "launch failed requeued held" state. We traced this to a failed prolog. We fixed the problem but the jobs remained in this state. Is there a way to configure slurm so that it will automatically release the job from the Held state so that it can run

Re: [slurm-users] Apparent scontrol reboot bug

2019-01-22 Thread Bas van der Vlies
Thanks for the update. We gonna try to build a new package and test it. On 22/01/2019 15:30, Douglas Jacobsen wrote: There were several related commits last week: https://github.com/SchedMD/slurm/commits/slurm-18.08 On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen > w

Re: [slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-22 Thread Greg Wickham
I had some time this afternoon so dug into the source code (of slurm and pmix) and found the issue. In the file: slurm-18.08.4/src/plugins/mpi/pmix/pmixp_client.c line 147 (first instance): PMIX_VAL_SET(&kvp->value, flag, 0); “PMIX_VAL_SET” is a macro from /usr/include/pmix_common.h (version

Re: [slurm-users] Apparent scontrol reboot bug

2019-01-22 Thread Douglas Jacobsen
There were several related commits last week: https://github.com/SchedMD/slurm/commits/slurm-18.08 On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen wrote: > Hello, > > Yes it's a bug in the way the reboot rpcs are handled. A fix was recently > committed which we have yet to test, but 18.08.5 is

Re: [slurm-users] Apparent scontrol reboot bug

2019-01-22 Thread Douglas Jacobsen
Hello, Yes it's a bug in the way the reboot rpcs are handled. A fix was recently committed which we have yet to test, but 18.08.5 is meant to repair this (among other things). Doug On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten wrote: > Hi, > > We encounter a strange issue on our system (Slurm

[slurm-users] Slurm and License management with LMTools

2019-01-22 Thread Snedden, Ali
My institution uses LMTools on a license server. I noticed in the man page and in the License Guide (https://slurm.schedmd.com/licenses.html) there is only mention of using FlexNext Publisher (Flexlm) license server or Reprise License Manager (RLM). Question : How do I get slurm to interface

Re: [slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-22 Thread Michael Di Domenico
i've seen the same error, i don't think it's you. but i don't know what the cause is either, i didn't have time to look into it so i backed up to pmix 2.2.1 which seems to work fine On Tue, Jan 22, 2019 at 12:56 AM Greg Wickham wrote: > > > Hi All, > > I’m trying to build pmix 3.1.1 against slur

[slurm-users] Change Licenses of running jobs

2019-01-22 Thread Henkel
Hi all, does anybody know if it is or should be possible to change the amount of licenses consumed by a running job via scontrol update job licenses=: It is working for root and an admin-account. But if a user wants to release some licenses of his own job he gets Job is not pending nor runni

[slurm-users] Apparent scontrol reboot bug

2019-01-22 Thread Martijn Kruiten
Hi, We encounter a strange issue on our system (Slurm 18.08.3), and I'm curious whether anyone of you recognizes this behavior. In the following example we try to reboot 32 nodes, of which 31 nodes are idle: root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32] root# sinfo -o "%100