date:20230113

[slurm-users] OverSubscribe=YES:4 starting 5 jobs

2023-01-13 Thread Richard Ems

Hi all, We configured a partition with OverSubscribe=YES:4 expecting that partition to start a max of 4 jobs. But we see that 5 jobs get started on a node. We use also --mem=34G and since most nodes have 192G, 5 jobs would fit, but we still want only 4 jobs to start. Setting a higher mem value is

Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-13 Thread Cristóbal Navarro

Many thanks Rodrigo and Daniel, Indeed I misunderstood that part of Slurm, so thanks for clarifying this aspect now it makes a lot of sense. Regarding the approach, I went with the cgroup.conf approach as suggested by both. I will start doing some synthetic tests to make sure the job gets killed on

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel

Thanks for all your Help Kevin, I really did miss the OverSubscribe option in the docs :-( But now cpu job scheduling is working and I have a picture of the problem with gpu job scheduling to dig further :-) On Fri, 13 Jan 2023 at 13:01, Kevin Broch wrote: > Sorry to hear that. Hopefully o

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Kevin Broch

Sorry to hear that. Hopefully others in the group have some ideas/explanations. I haven't had to deal with GPU resources in Slurm. On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel wrote: > Oh, ok. > I guess I was expecting that the GPU job was suspended copying GPU memory > to RAM memory. > > I tr

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel

Oh, ok. I guess I was expecting that the GPU job was suspended copying GPU memory to RAM memory. I tried also: REQUEUE,GANG and CANCEL,GANG. None of these options seems to be able to preempt GPU jobs On Fri, 13 Jan 2023 at 12:30, Kevin Broch wrote: > My guess, is that this isn't possible with

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel

PS: I checked the resources while running the 3 GPU jobs which where launched with: sbatch --gpus-per-task=2 --cpus-per-task=1 cnn-multi.sh The server have 64 cores (32 x2 with hyperthreading) cat /proc/cpuinfo | grep processor | tail -n1 processor : 63 128 GB main memory: hdaniel@asimov:~/Wor

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Kevin Broch

My guess, is that this isn't possible with GANG,SUSPEND. GPU memory isn't managed in Slurm so the idea of suspending GPU memory for another job to use the rest simply isn't possible. On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel wrote: > Hi Kevin > > I did a "scontrol show partition". > Oversub

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Helder Daniel

Hi Kevin I did a "scontrol show partition". Oversubscribe was not enabled. I enable it in slurm.conf with: (...) GresTypes=gpu NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES MaxTime=INFINI

Re: [slurm-users] Cannot enable Gang scheduling

2023-01-13 Thread Kevin Broch

Problem might be that OverSubscribe is not enabled? w/o it, I don't believe the time-slicing can be GANG scheduled Can you do a "scontrol show partition" to verify that it is? On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel wrote: > Hi, > > I am trying to enable gang scheduling on a server with

[slurm-users] OverSubscribe=YES:4 starting 5 jobs

Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

Re: [slurm-users] Cannot enable Gang scheduling

Re: [slurm-users] Cannot enable Gang scheduling

Re: [slurm-users] Cannot enable Gang scheduling

Re: [slurm-users] Cannot enable Gang scheduling

Re: [slurm-users] Cannot enable Gang scheduling

Re: [slurm-users] Cannot enable Gang scheduling

Re: [slurm-users] Cannot enable Gang scheduling

9 matches

Site Navigation

Mail list logo

Footer information