Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-19 Thread Chris Samuel
On Tuesday, 19 March 2019 2:03:27 PM PDT Frava wrote: > I'm struggling to get an heterogeneous job to run... > The SLURM version installed on the cluster is 17.11.12 Your Slurm is too old for this to work, you'll need to upgrade to 18.08. I believe you can enable them with "enable_hetero_steps" o

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Christopher Samuel
On 3/19/19 5:31 AM, Peter Steinbach wrote: For example, let's say I have a 4-core GPU node called gpu1. A non-GPU job $ sbatch --wrap="sleep 10 && hostname" -c 3 Can you share the output for "scontrol show job [that job id]" once you submit this please? Also please share "scontrol show node

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Peter Steinbach
Hi Benson, As you can perhaps see from our slurm.conf, we have task affinity or similar switches off. Along the same route, i also removed the core binding of the GPUs. That is why, I am quite surprised, that slurm doesn’t allow new jobs in. I am aware of the PCIe bandwidth implications of a GP

[slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-19 Thread Frava
Hi all, I'm struggling to get an heterogeneous job to run... The SLURM version installed on the cluster is 17.11.12 Here are the SBATCH file parameters of the job : #!/bin/bash #SBATCH --threads-per-core=1 #SBATCH --

Re: [slurm-users] Slurm cannot kill a job which time limit exhausted

2019-03-19 Thread Prentice Bisbal
Slurm is trying to kill the job that is exceeding it's time limit, but the job doesn't die, so Slurm marks the node down because it sees this as a problem with the node. Increasing the value for GraceTime or  KillWait might help: *GraceTime* Specifies, in units of seconds, the preemption

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Benson Muite
Hi, Many MPI implementations will have some sort of core binding allocation policy - which may impact such node sharing. Would these only be limited to single CPU jobs? Can users request a particular core, for example for a GPU based job some cores will have better memory transfer rates to the

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Peter Steinbach
I've read through the parameters. I am not sure if any of those would help in our situation. What suggestions would you make? Note, it's not the scheduler policy that appears to hinder us. It's about how slurm keeps track of the generic resource and (potentially) binds it to available cores. Th

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Peter Steinbach
Dear Eli, thanks for your reply. The slurm.conf file I suggested lists this parameter. We use SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory See also: https://github.com/psteinb/docker-centos7-slurm/blob/18.08.5-with-gres/slurm.conf#L60 I'll check if that makes a difference.

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Eli V
On Tue, Mar 19, 2019 at 8:34 AM Peter Steinbach wrote: > > Hi, > > we are struggling with a slurm 18.08.5 installation of ours. We are in a > situation, where our GPU nodes have a considerable number of cores but > "only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs > can ente

[slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Peter Steinbach
Hi, we are struggling with a slurm 18.08.5 installation of ours. We are in a situation, where our GPU nodes have a considerable number of cores but "only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs can enter alright. However, we found out the hard way, that the inverse

[slurm-users] Slurm cannot kill a job which time limit exhausted

2019-03-19 Thread Taras Shapovalov
Hey guys, When a job max time is exceeded, then Slurm tries to kill the job and fails: [2019-03-15T09:44:03.589] sched: _slurm_rpc_allocate_resources JobId=1325 NodeList=rn003 usec=355 [2019-03-15T09:44:03.928] prolog_running_decr: Configuration for JobID=1325 is complete [2019-03-15T09:45:12.739