Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-14 Thread Cristóbal Navarro
typing error, should be --> **located at /usr/include/nvml.h** On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > Hi community, > I have set up the configuration files as mentioned in the documentation, > but the slurmd of the GPU-compute node fails with t

[slurm-users] AutoDetect=nvml throwing an error message

2021-04-14 Thread Cristóbal Navarro
Hi community, I have set up the configuration files as mentioned in the documentation, but the slurmd of the GPU-compute node fails with the following error shown in the log. After reading the slurm documentation, it is not entirely clear to me how to properly set up GPU autodetection for the gres.

[slurm-users] srun/sbatch dependency not working

2021-04-14 Thread Darin Gowan
Dear distinguished list, I am new to SLURM. I have recently installed SLURM 20.11.3 on two separate three node clusters. The first cluster was for testing purposes using three small RHEL 7.7 VMs (8 core, 8G RAM). After a successful installation and some sbatch testing, I proceeded to the second

Re: [slurm-users] derived counters

2021-04-14 Thread Matthew BETTINGER
Before you get all excited about it, we have had a terrible time trying to get gppu metrics. Finally abandoned and switch to Grafana, Prometheus influx. Good luck to you though. From: slurm-users on behalf of "Heckes, Frank" Reply-To: Slurm User Community List Date: Wednesday, April 14,

Re: [slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

2021-04-14 Thread Thomas Arildsen
Oh and I forgot to mention that we are using Slurm version 20.11.3. Best, Thomas ons, 14 04 2021 kl. 09:23 +0200, skrev Thomas Arildsen: I administer a Slurm cluster with many users and the operation of the cluster currently appears "totally normal" for all users; except for one. This one user

[slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

2021-04-14 Thread Thomas Arildsen
I administer a Slurm cluster with many users and the operation of the cluster currently appears "totally normal" for all users; except for one. This one user gets all attempts to run commands through Slurm killed after 20-25 seconds (I think the cause is another job - not so much the time, see furt