Re: [slurm-users] floating condo partition, , no pre-emption, guarantee a max pend time?

2020-04-27 Thread Matt Jay
Paul, I saw your message, and while I don't have a specific suggestion for your overall situation off the top of my head, I did want to point out a pitfall our site discovered early on in our implementation of our condo model cluster, which to my knowledge still exists: Specifically (see https

Re: [slurm-users] Slurm not detecting gpu after swapping out gpu

2020-04-27 Thread Sarlo, Jeffrey S
How do you have fabricnode2 defined in your gres.conf file and the slurm.conf file? Since the type of gpu changed, maybe the definition for it needs to be updated also. Jeff From: slurm-users on behalf of Dean Schulze Sent: Monday, April 27, 2020 11:47 AM To

[slurm-users] Slurm not detecting gpu after swapping out gpu

2020-04-27 Thread Dean Schulze
I replaced a Nvidia v100 with a t4. Now slurm thinks there is no gpu present: $ sudo scontrol show node fabricnode2 NodeName=fabricnode2 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUTot=12 CPULoad=0.02 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia:1 NodeAddr=fabricno

Re: [slurm-users] not allocating jobs even resources are free

2020-04-27 Thread Daniel Letai
Are you sure there are enough resources available? The node is in mixed state, so it's configured for both partitions - it's possible that earlier lower priority jobs are already running thus blocking the later jobs, especially since it's fifo. It would re

Re: [slurm-users] Error when running srun: error: task X launch failed: Invalid MPI plugin name

2020-04-27 Thread Josep Guerrero
Hi again, > > So does someone have any suggestion about what I could try? > > Please have a look at: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=954272 This seems to have worked. Thanks a lot! Just in case someone else is interested, that debian bug thread suggests the following wor

Re: [slurm-users] Error when running srun: error: task X launch failed: Invalid MPI plugin name

2020-04-27 Thread Gennaro Oliva
Hi Josep, On Mon, Apr 27, 2020 at 12:26:56PM +0200, Josep Guerrero wrote: > does not seem to have support for pmix. There seems to be an "openmpi" > option, > but I haven't been able to find documentation on how it is supposed to work. > So, as I understand the situation, Debian openmpi package

[slurm-users] Error when running srun: error: task X launch failed: Invalid MPI plugin name

2020-04-27 Thread Josep Guerrero
Dear all, I'm trying to install slurm, for the first time, as a queue managing system in a computing cluster. All of the nodes are using Debian 10, and for OpenMPI I'm using the distribution packages (openmpi 3.1.3): === $ ompi_info Package: Debian OpenMPI

[slurm-users] QOS cutting off users before CPU limit is reached

2020-04-27 Thread Simon Andrews
I'm trying to use QoS limits to dynamically change the number of CPUs a user is allowed to use on our cluster. As far as I can see I'm setting the appropriate GrpTRES=cpu value and I can read that back, but then jobs are being stopped before the user has reached that limit. In squeue I see loa

Re: [slurm-users] need to use unused cores | wherein all compute nodes are ALLOC

2020-04-27 Thread Peter Kjellström
On Mon, 27 Apr 2020 14:51:01 +0530 Sudeep Narayan Banerjee wrote: > Dear All, > > I have 360 cpu cores in my cluster; 9 compute nodes with 20core x 2 > sockets each. > > I have slurm.18.08.7 version and have multifactor (fair share) and > backfill enabled. > > I am running jobs with less nta

[slurm-users] need to use unused cores | wherein all compute nodes are ALLOC

2020-04-27 Thread Sudeep Narayan Banerjee
Dear All, I have 360 cpu cores in my cluster; 9 compute nodes with 20core x 2 sockets each. I have slurm.18.08.7 version and have multifactor (fair share) and backfill enabled. I am running jobs with less ntasks_per_node in the script and at some point all my compute nodes are ALLOC (with