[slurm-users] Re: Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

2024-09-29 Thread Cristóbal Navarro via slurm-users
Sat, Sep 28, 2024, 2:13 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > Dear community, > I am having a strange issue I have been unable to find the cause. Last > week I did a full update on the cluster, which is composed of the master > node, and two comput

[slurm-users] Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

2024-09-28 Thread Cristóbal Navarro via slurm-users
Dear community, I am having a strange issue I have been unable to find the cause. Last week I did a full update on the cluster, which is composed of the master node, and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom GPU server). After the update, I got - master node ended up w

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-24 Thread Cristóbal Navarro
Stefan Fleischmann wrote: > > On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro > > wrote: > >> Many thanks > >> One question? Do we have to apply this patch (and recompile slurm i > >> guess) only on the compute-node with problems? > >> Also, I notic

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-24 Thread Cristóbal Navarro
Many thanks One question? Do we have to apply this patch (and recompile slurm i guess) only on the compute-node with problems? Also, I noticed the patch now appears as "obsolete", is that ok? On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann wrote: > Turns out I was wrong, this is not a problem

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-22 Thread Cristóbal Navarro
Hi Tim and community, We are currently having the same issue (cgroups not working it seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days ago after a full update (apt upgrade). Now whenever we launch a job for that partition, we get the error message mentioned by Tim.

Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-07-24 Thread Cristóbal Navarro
Hello Angel and Community, I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02. When I execute `slurmd` service, it status shows failed with the following information below. As of today, what is the best solution to this problem? I am really not s

Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-13 Thread Cristóbal Navarro
MSpace*, where processes in the step will > be killed, but the step will be left active, possibly with other processes > left running. > > > > On 12/01/2023 03:47:53, Cristóbal Navarro wrote: > > Hi Slurm community, > Recently we found a small problem triggered by one of our

[slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-11 Thread Cristóbal Navarro
Hi Slurm community, Recently we found a small problem triggered by one of our jobs. We have a *MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file, however we found out that a job that started with mem=65536, and after hours of execution it was able to grow its memory usage durin

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-05-20 Thread Cristóbal Navarro
Hi Community, just wanted to share that this problem got solved with the help of pyxis developers https://github.com/NVIDIA/pyxis/issues/47 The solution was to add ConstrainDevices=yes as it was missing in the cgroup.conf file On Thu, May 13, 2021 at 5:14 PM Cristóbal Navarro < cristobal.nav

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-05-13 Thread Cristóbal Navarro
emPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 Default=YES PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 MaxMemPerNode=42 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 On Tue, Apr 13, 2021 at 9:38 PM Cristóbal Navarro < cristobal.navarr...

Re: [slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

2021-05-08 Thread Cristóbal Navarro
de the accurately > describes what you want to allow access to at all. > > Then you limit what is allowed to be requested in the partition definition > and/or a QOS (if you are using accounting). > > Brian Andrus > On 5/7/2021 8:11 PM, Cristóbal Navarro wrote: > >

[slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

2021-05-07 Thread Cristóbal Navarro
Hi community, I am unable to tell if SLURM is handling the following situation efficiently in terms of CPU affinities at each partition. Here we have a very small cluster with just one GPU node with 8x GPUs, that offers two partitions --> "gpu" and "cpu". Part of the Config File ## Nodes list ## u

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-04-27 Thread Cristóbal Navarro
se use srun or sbatch...". > > > > Patrick > > > > Le 24/04/2021 à 10:03, Ole Holm Nielsen a écrit : > >> On 24-04-2021 04:37, Cristóbal Navarro wrote: > >>> Hi Community, > >>> I have a set of users still not so familiar with slurm, a

[slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-04-23 Thread Cristóbal Navarro
Hi Community, I have a set of users still not so familiar with slurm, and yesterday they bypassed srun/sbatch and just ran their CPU program directly on the head/login node thinking it would still run on the compute node. I am aware that I will need to teach them some basic usage, but in the meanwh

Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-15 Thread Cristóbal Navarro
when you built the slurm source it > wasn't able to find the nvml devel packages. if you look in where you > installed slurm, in lib/slurm you should have a gpu_nvml.so. do you? > > On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro > wrote: > > > > typing error,

Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-14 Thread Cristóbal Navarro
typing error, should be --> **located at /usr/include/nvml.h** On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > Hi community, > I have set up the configuration files as mentioned in the documentation, > but the slurmd of the GPU-compu

[slurm-users] AutoDetect=nvml throwing an error message

2021-04-14 Thread Cristóbal Navarro
Hi community, I have set up the configuration files as mentioned in the documentation, but the slurmd of the GPU-compute node fails with the following error shown in the log. After reading the slurm documentation, it is not entirely clear to me how to properly set up GPU autodetection for the gres.

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-13 Thread Cristóbal Navarro
by | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> * UoM notice:

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-11 Thread Cristóbal Navarro
ad > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> * UoM notice:

[slurm-users] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-10 Thread Cristóbal Navarro
Hi Community, These last two days I've been trying to understand what is the cause of the "Unable to allocate resources" error I keep getting when specifying --gres=... in a srun command (or sbatch). It fails with the error ➜ srun --gres=gpu:A100:1 nvidia-smi srun: error: Unable to allocate resou

Re: [slurm-users] Is it possible to define multiple partitions for the same node, but each one having a different subset of GPUs?

2021-03-31 Thread Cristóbal Navarro
s=tesla:1 to request > one P100 gpu. > > > > This is an example from https://slurm.schedmd.com/slurm.conf.html > > > > (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G") > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On

[slurm-users] Is it possible to define multiple partitions for the same node, but each one having a different subset of GPUs?

2021-03-31 Thread Cristóbal Navarro
Hi Community, I was checking the documentation but could find clear information on what I am trying to do. Here at the university we have a large compute node with 3 classes of GPUs. Lets say the node's hostname is "gpuComputer", it is composed of: - 4x large GPUs - 4x medium GPUs (MIG devic