Re: [slurm-users] how can users start their worker daemons using srun?

2018-08-31 Thread Chris Samuel
On Saturday, 1 September 2018 2:33:39 AM AEST Priedhorsky, Reid wrote: > That is, it exceeds both the CPU count (1) and memory (1KiB) that I told > Slurm it would use. This is what I want. Is allowing such exceedance a > common configuration? I don’t want to rely on quirks of our site. I think yo

Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

2018-08-31 Thread Chris Samuel
On Friday, 31 August 2018 1:48:33 AM AEST Chaofeng Zhang wrote: > This result should be CUDA_VISIBLE_DEVICES=NoDevFiles, and it really is > NoDevFiles in 17.02. So this must be a bug in 17.11.7. Looking at git it looks like this code got refactored out of the GPU GRES plugin and in to some common

Re: [slurm-users] Recovering from network failures in Slurm (without killing or restarting active jobs)

2018-08-31 Thread Paul Edmon
So there are different options you can set for Return to Service in the slurm.conf which can effect how the node is handled on reconnect.  You can also up the timeouts for the daemons. -Paul Edmon- On 8/31/2018 5:06 PM, Renfro, Michael wrote: Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs

[slurm-users] Recovering from network failures in Slurm (without killing or restarting active jobs)

2018-08-31 Thread Renfro, Michael
Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs provided by Bright Computing, if it matters) with both gigabit Ethernet and Infiniband interfaces. Twice in the last year, I’ve had a failure inside the stacked Ethernet switches that’s caused Slurm to lose track of node and job state. Jobs kept r

Re: [slurm-users] how can users start their worker daemons using srun?

2018-08-31 Thread Priedhorsky, Reid
> On Aug 28, 2018, at 6:13 PM, Christopher Samuel wrote: > > On 29/08/18 09:10, Priedhorsky, Reid wrote: > >> This is surprising to me, as my interpretation is that the first run >> should allocate only one CPU, leaving 35 for the second srun, which >> also only needs one CPU and need not wait.

[slurm-users] SelectTypeParameters=CR_Core_Memory

2018-08-31 Thread Antonis Potirakis
Suppose a slum configuration with -Number of Nodes=1, number of cores=20, memory=128GB -SelectType=select/cons_res -SelectTypeParameters=CR_Core_Memory 1) Is OverSubscribe=NO the default value generally? 2) Does OverSubscribe=NO mean that at each core only one job can be allocated but at the same