Thanks for all your Help Kevin,
I really did miss the OverSubscribe option in the docs :-(
But now cpu job scheduling is working and I have a picture of the problem
with gpu job scheduling to dig further :-)
On Fri, 13 Jan 2023 at 13:01, Kevin Broch wrote:
> Sorry to hear that. Hopefully o
Sorry to hear that. Hopefully others in the group have some
ideas/explanations. I haven't had to deal with GPU resources in Slurm.
On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel wrote:
> Oh, ok.
> I guess I was expecting that the GPU job was suspended copying GPU memory
> to RAM memory.
>
> I tr
Oh, ok.
I guess I was expecting that the GPU job was suspended copying GPU memory
to RAM memory.
I tried also: REQUEUE,GANG and CANCEL,GANG.
None of these options seems to be able to preempt GPU jobs
On Fri, 13 Jan 2023 at 12:30, Kevin Broch wrote:
> My guess, is that this isn't possible with
PS: I checked the resources while running the 3 GPU jobs which where
launched with:
sbatch --gpus-per-task=2 --cpus-per-task=1 cnn-multi.sh
The server have 64 cores (32 x2 with hyperthreading)
cat /proc/cpuinfo | grep processor | tail -n1
processor : 63
128 GB main memory:
hdaniel@asimov:~/Wor
My guess, is that this isn't possible with GANG,SUSPEND. GPU memory isn't
managed in Slurm so the idea of suspending GPU memory for another job to
use the rest simply isn't possible.
On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel wrote:
> Hi Kevin
>
> I did a "scontrol show partition".
> Oversub
Hi Kevin
I did a "scontrol show partition".
Oversubscribe was not enabled.
I enable it in slurm.conf with:
(...)
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
State=UNKNOWN
PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
MaxTime=INFINI
Problem might be that OverSubscribe is not enabled? w/o it, I don't
believe the time-slicing can be GANG scheduled
Can you do a "scontrol show partition" to verify that it is?
On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel wrote:
> Hi,
>
> I am trying to enable gang scheduling on a server with
Hi,
I am trying to enable gang scheduling on a server with a CPU with 32 cores
and 4 GPUs.
However, using Gang sched, the cpu jobs (or gpu jobs) are not being
preempted after the time slice, which is set to 30 secs.
Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The
first