Paul,
I saw your message, and while I don't have a specific suggestion for your
overall situation off the top of my head, I did want to point out a pitfall our
site discovered early on in our implementation of our condo model cluster,
which to my knowledge still exists:
Specifically (see https
How do you have fabricnode2 defined in your gres.conf file and the slurm.conf
file? Since the type of gpu changed, maybe the definition for it needs to be
updated also.
Jeff
From: slurm-users on behalf of Dean
Schulze
Sent: Monday, April 27, 2020 11:47 AM
To
I replaced a Nvidia v100 with a t4. Now slurm thinks there is no gpu
present:
$ sudo scontrol show node fabricnode2
NodeName=fabricnode2 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=0.02
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:nvidia:1
NodeAddr=fabricno
Are you sure there are enough resources available? The node is in
mixed state, so it's configured for both partitions - it's
possible that earlier lower priority jobs are already running thus
blocking the later jobs, especially since it's fifo.
It would re
Hi again,
> > So does someone have any suggestion about what I could try?
>
> Please have a look at:
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=954272
This seems to have worked. Thanks a lot!
Just in case someone else is interested, that debian bug thread suggests the
following wor
Hi Josep,
On Mon, Apr 27, 2020 at 12:26:56PM +0200, Josep Guerrero wrote:
> does not seem to have support for pmix. There seems to be an "openmpi"
> option,
> but I haven't been able to find documentation on how it is supposed to work.
> So, as I understand the situation, Debian openmpi package
Dear all,
I'm trying to install slurm, for the first time, as a queue managing system in
a computing cluster. All of the nodes are using Debian 10, and for OpenMPI I'm
using the distribution packages (openmpi 3.1.3):
===
$ ompi_info
Package: Debian OpenMPI
I'm trying to use QoS limits to dynamically change the number of CPUs a user is
allowed to use on our cluster. As far as I can see I'm setting the appropriate
GrpTRES=cpu value and I can read that back, but then jobs are being stopped
before the user has reached that limit.
In squeue I see loa
On Mon, 27 Apr 2020 14:51:01 +0530
Sudeep Narayan Banerjee wrote:
> Dear All,
>
> I have 360 cpu cores in my cluster; 9 compute nodes with 20core x 2
> sockets each.
>
> I have slurm.18.08.7 version and have multifactor (fair share) and
> backfill enabled.
>
> I am running jobs with less nta
Dear All,
I have 360 cpu cores in my cluster; 9 compute nodes with 20core x 2
sockets each.
I have slurm.18.08.7 version and have multifactor (fair share) and
backfill enabled.
I am running jobs with less ntasks_per_node in the script and at some
point all my compute nodes are ALLOC (with
10 matches
Mail list logo