Hi Beatrice,
we are also still on 18.08.7. But we have a similar problem here with
the billing, which is much too high (cmp. "[slurm-users] exclusive or
not exclusive, that is the question"). But Slurm > 18.08.7 exacerbates
the problem, as those jobs don't even get scheduled :/
Best
Marcus
On 1/10/20 4:58 PM, Beatrice Charton wrote:
Hi,
Happy new Year ;-)
I just update Slurm to 18.08.9 : same behaviour. Jobs still stay PD for ever
instead of being refused :-(
Am I the only one in this situation ?
Sincerely,
Béatrice
Le 16 déc. 2019 à 09:49, Beatrice Charton <beatrice.char...@criann.fr> a écrit :
Hi Marcus and Bjørn-Helge
Thank you for your answers.
We don’t use slurm billing. We use system acct billing.
I also confirm that with --exclusive, there is a difference between ReqCPUS and
AllocCPUS, but --mem-per-cpu was more a --mem-per-task than a --mem-per-cpu :
it was associated to ReqCPUS. It looks like now it is associated to AllocCPUS.
If it’s not a side effect, why do jobs and not rejected instead of accepted and
Pending for ever ?
The behaviour is the same in 19.05.2 but recorrected in 19.05.3 so the problem
seems to be known in v19 but not corrected in v18.
Sincerely,
Béatrice
Le 12 déc. 2019 à 12:10, Marcus Wagner <wag...@itc.rwth-aachen.de> a écrit :
Hi Beatrice and Bjørn-Helge,
I can sign, that it works with 18.08.7. We additionally use TRESBillingWeights
together with PriorityFlags=MAX_TRES. For example:
TRESBillingWeights="CPU=1.0,Mem=0.1875G,gres/gpu=12.0"
We use the billing factor for our external accounting. We do this to do a fair
accounting of the nodes. But we do have a similar effect due to --exclusive.
In Beatrice case, the billingweight would be:
TRESBillingWeights="CPU=1.0,Mem=0.21875G"
So, a 10 cpu job with 1 GB per cpu would be billed 10.
An 1 cpu job with 10 GB would be billed 2 (0.21875*10, floor).
An exclusive 10 cpu job with 1 GB per cpu would be billed 28 (all 28 cores are
for the job).
An exclusive 1 cpu job with 30GB (Beatrice' example) would be billed
28(cores)*30(GB)*0.21875 => 118.125 => 118 cores.
Best
Marcus
On 12/12/19 9:47 AM, Bjørn-Helge Mevik wrote:
Beatrice Charton <beatrice.char...@criann.fr> writes:
Hi,
We have a strange behaviour of Slurm after updating from 18.08.7 to
18.08.8, for jobs using --exclusive and --mem-per-cpu.
Our nodes have 128GB of memory, 28 cores.
$ srun --mem-per-cpu=30000 -n 1 --exclusive hostname
=> works in 18.08.7
=> doesn’t work in 18.08.8
I'm actually surprised it _worked_ in 18.08.7. At one time - long before
v 18.08, the behaviour was changed when using --exclusive: In order to
account the job for all cpus on the node, the number of
cpus asked for with --ntasks would simply be multiplied with with
"#cpus-on-node / --ntasks" (so in your case: 28). Unfortunately, that
also means that the memory the job requires per node is "#cpus-on-node /
--ntasks" multiplied with --mem-per-cpu (in your case 28 * 30000 MiB ~=
820 GiB). For this reason, we tend to ban --exclusive on our clusters
(or at least warn about it).
I haven't looked at the code for a long time, so I don't know whether
this is still the current behaviour, but every time I've tested, I've
seen the same problem. I believe I've tested on 19.05 (but I might
remember wrong).
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de
--
Béatrice CHARTON | CRIANN
beatrice.char...@criann.fr | 745, avenue de l'Université
Tel : +33 (0)2 32 91 42 91 | 76800 Saint Etienne du Rouvray
--- Support : supp...@criann.fr ---
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de