On Saturday, 1 September 2018 2:33:39 AM AEST Priedhorsky, Reid wrote:
> That is, it exceeds both the CPU count (1) and memory (1KiB) that I told
> Slurm it would use. This is what I want. Is allowing such exceedance a
> common configuration? I don’t want to rely on quirks of our site.
I think yo
On Friday, 31 August 2018 1:48:33 AM AEST Chaofeng Zhang wrote:
> This result should be CUDA_VISIBLE_DEVICES=NoDevFiles, and it really is
> NoDevFiles in 17.02. So this must be a bug in 17.11.7.
Looking at git it looks like this code got refactored out of the GPU GRES plugin
and in to some common
So there are different options you can set for Return to Service in the
slurm.conf which can effect how the node is handled on reconnect. You
can also up the timeouts for the daemons.
-Paul Edmon-
On 8/31/2018 5:06 PM, Renfro, Michael wrote:
Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs
Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs provided by Bright Computing,
if it matters) with both gigabit Ethernet and Infiniband interfaces. Twice in
the last year, I’ve had a failure inside the stacked Ethernet switches that’s
caused Slurm to lose track of node and job state. Jobs kept r
> On Aug 28, 2018, at 6:13 PM, Christopher Samuel wrote:
>
> On 29/08/18 09:10, Priedhorsky, Reid wrote:
>
>> This is surprising to me, as my interpretation is that the first run
>> should allocate only one CPU, leaving 35 for the second srun, which
>> also only needs one CPU and need not wait.
Suppose a slum configuration with
-Number of Nodes=1, number of cores=20, memory=128GB
-SelectType=select/cons_res
-SelectTypeParameters=CR_Core_Memory
1) Is OverSubscribe=NO the default value generally?
2) Does OverSubscribe=NO mean that at each core only one job can be
allocated but at the same