Hi all,
We configured a partition with
OverSubscribe=YES:4
expecting that partition to start a max of 4 jobs. But we see that 5 jobs
get started on a node.
We use also
--mem=34G
and since most nodes have 192G, 5 jobs would fit, but we still want only 4
jobs to start. Setting a higher mem value is
Many thanks Rodrigo and Daniel,
Indeed I misunderstood that part of Slurm, so thanks for clarifying this
aspect now it makes a lot of sense.
Regarding the approach, I went with the cgroup.conf approach as suggested
by both.
I will start doing some synthetic tests to make sure the job gets killed
on
Thanks for all your Help Kevin,
I really did miss the OverSubscribe option in the docs :-(
But now cpu job scheduling is working and I have a picture of the problem
with gpu job scheduling to dig further :-)
On Fri, 13 Jan 2023 at 13:01, Kevin Broch wrote:
> Sorry to hear that. Hopefully o
Sorry to hear that. Hopefully others in the group have some
ideas/explanations. I haven't had to deal with GPU resources in Slurm.
On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel wrote:
> Oh, ok.
> I guess I was expecting that the GPU job was suspended copying GPU memory
> to RAM memory.
>
> I tr
Oh, ok.
I guess I was expecting that the GPU job was suspended copying GPU memory
to RAM memory.
I tried also: REQUEUE,GANG and CANCEL,GANG.
None of these options seems to be able to preempt GPU jobs
On Fri, 13 Jan 2023 at 12:30, Kevin Broch wrote:
> My guess, is that this isn't possible with
PS: I checked the resources while running the 3 GPU jobs which where
launched with:
sbatch --gpus-per-task=2 --cpus-per-task=1 cnn-multi.sh
The server have 64 cores (32 x2 with hyperthreading)
cat /proc/cpuinfo | grep processor | tail -n1
processor : 63
128 GB main memory:
hdaniel@asimov:~/Wor
My guess, is that this isn't possible with GANG,SUSPEND. GPU memory isn't
managed in Slurm so the idea of suspending GPU memory for another job to
use the rest simply isn't possible.
On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel wrote:
> Hi Kevin
>
> I did a "scontrol show partition".
> Oversub
Hi Kevin
I did a "scontrol show partition".
Oversubscribe was not enabled.
I enable it in slurm.conf with:
(...)
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
State=UNKNOWN
PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
MaxTime=INFINI
Problem might be that OverSubscribe is not enabled? w/o it, I don't
believe the time-slicing can be GANG scheduled
Can you do a "scontrol show partition" to verify that it is?
On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel wrote:
> Hi,
>
> I am trying to enable gang scheduling on a server with