[slurm-dev] QOS grp_used_submit_jobs value wrong until slurmctld restarted

2015-03-27 Thread Trey Dockendorf
We recently ran into an issue where a user was submitting job arrays of 0-500 one after another and would receive a message that the QOS limit had been reached when "squeue --qos hepx --noheader |wc -l" would only print ~3000. The GrpSubmitJobs value for that QOS is 5000. I looked at the code of

[slurm-dev] Re: Current resource limit status

2015-03-27 Thread Jared Casper
We aren't using QOS values right now, just basic node limits on accounts, but I'm guessing they shouldn't be hard to adapt. I'd certainly appreciate it if you can send them along. Thanks! Jared On Fri, Mar 27, 2015 at 6:00 AM, Bill Wichser wrote: > > Jared, > > > I have a few script to show Q

[slurm-dev] Re: Change in behavior of "--export" option

2015-03-27 Thread Jesse Stroik
Shawn, We observed the same behavior with our upgrade. I believe --export was not an option for srun prior to 14.11.0-pre4. From the NEWS file included with SLURM: = -- Added srun --export option to set/export specific environment variables. = According to the man page, t

[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

2015-03-27 Thread Stuart Rankin
Thanks very much for these suggestions - I've set a value for max_rpc_cnt and we should see soon if this helps. Cheers Stuart On 27/03/15 14:09, Paul Edmon wrote: > > So we have had the same problem, usually due to the scheduler receiving tons > of requests. Usually > this is fixed by havi

[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

2015-03-27 Thread Paul Edmon
So we have had the same problem, usually due to the scheduler receiving tons of requests. Usually this is fixed by having the scheduler slow itself down by using defer, or the max_rpc_cnt options. We in particular use max_rpc_cnt=16. I actually did a test yesterday where I removed this and

[slurm-dev] Execution nice level

2015-03-27 Thread Brian B
Greetings, Our cluster has both non-slurm controlled interactive jobs and slurm controlled jobs being run on it. In general we would like to prioritize the non-slurm controlled interactive jobs by having slurm jobs niced to a level higher than the default. Is this possible. Regards, Brian s

[slurm-dev] Re: Current resource limit status

2015-03-27 Thread Bill Wichser
Jared, I have a few script to show QOS values and what is running under each. Users can use this to see how many resources are left. Somthing like this: # qos Name Priority GrpNodes GrpCPUs MaxCPUsPU MaxJobsPU MaxNodesPU MaxSubmit -- -- -

[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

2015-03-27 Thread Mehdi Denou
Hi, Using gdb you can retrieve which thread own the locks on the slurmctld internal structures (and block all the others). Then it will be easier to understand what is happening. Le 27/03/2015 12:24, Stuart Rankin a écrit : > Hi, > > I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpo

[slurm-dev] Re: Slurm and MUNGE security

2015-03-27 Thread Mehdi Denou
The VPN will only guarantee that no-one can sniff the traffic between nodes. It will not help you if one node is compromised: the attacker can use the VPN to communicate with the rest of the cluster. Le 27/03/2015 12:22, Simon Michnowicz a écrit : > Re: [slurm-dev] Re: Slurm and MUNGE security >

[slurm-dev] slurmctld thread number blowups leading to deadlock in 14.11.4

2015-03-27 Thread Stuart Rankin
Hi, I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University cluster. Since upgrading from 14.03.3 we have been seeing the following problem and I'd appreciate any advice (maybe it's a bug but maybe I'm missing something obvious). Occasionally the number of slurmctld threads

[slurm-dev] Re: Slurm and MUNGE security

2015-03-27 Thread Simon Michnowicz
Mehdi thanks for the response. Even though I am not sure how a MUNGE key could be compromised, (if you became root on a box you could equally take the ssh keys), prudence would dictate that SLURM traffic go via a VPN, so that one bad node does not effect others? regards Simon On 27 March 2015 at 2

[slurm-dev] Re: Slurm and MUNGE security

2015-03-27 Thread Mehdi Denou
Hi Simon, As far as I know, munge allows the communication to be authenticated but they are not encrypted. If the key is compromised, you may can send RPC to slurm daemons pretending you are the slurm controller (and the user requesting the job is root). So yes in theory you should be able to exec

[slurm-dev] SLURM (14.03.0) - Issue with overriding JobRequeue=0 from job

2015-03-27 Thread Roshan Mathew
Hi all, Has anybody seen a similar issue? I want to keep the norequeue as default, but allow users to override them from job script or command line during submission. After setting JobRequeue=0 in slurm.conf, jobs are not getting requeued - neither using: sbatch --requeue jobscript.slurm or us