On Tuesday, 19 March 2019 2:03:27 PM PDT Frava wrote:
> I'm struggling to get an heterogeneous job to run...
> The SLURM version installed on the cluster is 17.11.12
Your Slurm is too old for this to work, you'll need to upgrade to 18.08.
I believe you can enable them with "enable_hetero_steps" o
On 3/19/19 5:31 AM, Peter Steinbach wrote:
For example, let's say I have a 4-core GPU node called gpu1. A non-GPU job
$ sbatch --wrap="sleep 10 && hostname" -c 3
Can you share the output for "scontrol show job [that job id]" once you
submit this please?
Also please share "scontrol show node
Hi Benson,
As you can perhaps see from our slurm.conf, we have task affinity or similar
switches off. Along the same route, i also removed the core binding of the
GPUs. That is why, I am quite surprised, that slurm doesn’t allow new jobs in.
I am aware of the PCIe bandwidth implications of a GP
Hi all,
I'm struggling to get an heterogeneous job to run...
The SLURM version installed on the cluster is 17.11.12
Here are the SBATCH file parameters of the job :
#!/bin/bash
#SBATCH --threads-per-core=1
#SBATCH --
Slurm is trying to kill the job that is exceeding it's time limit, but
the job doesn't die, so Slurm marks the node down because it sees this
as a problem with the node. Increasing the value for GraceTime or
KillWait might help:
*GraceTime*
Specifies, in units of seconds, the preemption
Hi,
Many MPI implementations will have some sort of core binding allocation
policy - which may impact such node sharing. Would these only be limited
to single CPU jobs? Can users request a particular core, for example for
a GPU based job some cores will have better memory transfer rates to the
I've read through the parameters. I am not sure if any of those would
help in our situation. What suggestions would you make? Note, it's not
the scheduler policy that appears to hinder us. It's about how slurm
keeps track of the generic resource and (potentially) binds it to
available cores. Th
Dear Eli,
thanks for your reply. The slurm.conf file I suggested lists this
parameter. We use
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
See also:
https://github.com/psteinb/docker-centos7-slurm/blob/18.08.5-with-gres/slurm.conf#L60
I'll check if that makes a difference.
On Tue, Mar 19, 2019 at 8:34 AM Peter Steinbach wrote:
>
> Hi,
>
> we are struggling with a slurm 18.08.5 installation of ours. We are in a
> situation, where our GPU nodes have a considerable number of cores but
> "only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs
> can ente
Hi,
we are struggling with a slurm 18.08.5 installation of ours. We are in a
situation, where our GPU nodes have a considerable number of cores but
"only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs
can enter alright. However, we found out the hard way, that the inverse
Hey guys,
When a job max time is exceeded, then Slurm tries to kill the job and fails:
[2019-03-15T09:44:03.589] sched: _slurm_rpc_allocate_resources JobId=1325
NodeList=rn003 usec=355
[2019-03-15T09:44:03.928] prolog_running_decr: Configuration for JobID=1325
is complete
[2019-03-15T09:45:12.739
11 matches
Mail list logo