When I made similar queues, and only wanted my GPU jobs to use up to 8 cores per GPU, I set Cores=0-7 and 8-15 for each of the two GPU devices in gres.conf. Have you tried reducing those values to Cores=0 and Cores=20?
> On Feb 27, 2020, at 9:51 PM, Pavel Vashchenkov <vas...@itam.nsc.ru> wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > ________________________________ > > Hello, > > I have a hybrid cluster with 2 GPUs and 2 20-cores CPUs on each node. > > I created two partitions: - "cpu" for CPU-only jobs which are allowed to > allocate up to 38 cores per node - "gpu" for GPU-only jobs which are > allowed to allocate up to 2 GPUs and 2 CPU cores. > > Respective sections in slurm.conf: > > # NODES > NodeName=node[01-06] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 > Gres=gpu:2(S:0-1) RealMemory=257433 > > # PARTITIONS > PartitionName=cpu Default=YES Nodes=node[01-06] MaxNodes=6 MinNodes=0 > DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=38 > PartitionName=gpu Nodes=node[01-06] MaxNodes=6 MinNodes=0 > DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=2 > > and in gres.conf: > Name=gpu Type=v100 File=/dev/nvidia0 Cores=0-19 > Name=gpu Type=v100 File=/dev/nvidia1 Cores=20-39 > > However, it seems to be not working properly. If I first submit GPU job > using all available in "gpu" partition resources and then CPU job > allocating the rest of the CPU cores (i.e. 38 cores per node) in "cpu" > partition, it works perfectly fine. Both jobs start running. But if I > change the submission order and start CPU-job before GPU-job, the "cpu" > job starts running while the "gpu" job stays in queue with PENDING > status and RESOURCES reason. > > My first guess was that "cpu" job allocates cores assigned to respective > GPUs in gres.conf and prevents the GPU devices from running. However, it > seems not to be the case, because 37 cores job per node instead of 38 > solves the problem. > > Another thought was it has something to do with the specialized cores > reservation, but I tried to change CoreSpecCount option without success. > > So, any ideas how to fix this behavior and where should look? > > Thanks! >