from:"Cristóbal Navarro"

[slurm-users] Re: Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

2024-09-29 Thread Cristóbal Navarro via slurm-users

Sat, Sep 28, 2024, 2:13 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > Dear community, > I am having a strange issue I have been unable to find the cause. Last > week I did a full update on the cluster, which is composed of the master > node, and two comput

[slurm-users] Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

2024-09-28 Thread Cristóbal Navarro via slurm-users

Dear community, I am having a strange issue I have been unable to find the cause. Last week I did a full update on the cluster, which is composed of the master node, and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom GPU server). After the update, I got - master node ended up w

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-24 Thread Cristóbal Navarro

Stefan Fleischmann wrote: > > On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro > > wrote: > >> Many thanks > >> One question? Do we have to apply this patch (and recompile slurm i > >> guess) only on the compute-node with problems? > >> Also, I notic

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-24 Thread Cristóbal Navarro

Many thanks One question? Do we have to apply this patch (and recompile slurm i guess) only on the compute-node with problems? Also, I noticed the patch now appears as "obsolete", is that ok? On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann wrote: > Turns out I was wrong, this is not a problem

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-22 Thread Cristóbal Navarro

Hi Tim and community, We are currently having the same issue (cgroups not working it seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days ago after a full update (apt upgrade). Now whenever we launch a job for that partition, we get the error message mentioned by Tim.

Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-07-24 Thread Cristóbal Navarro

Hello Angel and Community, I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02. When I execute `slurmd` service, it status shows failed with the following information below. As of today, what is the best solution to this problem? I am really not s

Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-13 Thread Cristóbal Navarro

MSpace*, where processes in the step will > be killed, but the step will be left active, possibly with other processes > left running. > > > > On 12/01/2023 03:47:53, Cristóbal Navarro wrote: > > Hi Slurm community, > Recently we found a small problem triggered by one of our

[slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-11 Thread Cristóbal Navarro

Hi Slurm community, Recently we found a small problem triggered by one of our jobs. We have a *MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file, however we found out that a job that started with mem=65536, and after hours of execution it was able to grow its memory usage durin

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-05-20 Thread Cristóbal Navarro

Hi Community, just wanted to share that this problem got solved with the help of pyxis developers https://github.com/NVIDIA/pyxis/issues/47 The solution was to add ConstrainDevices=yes as it was missing in the cgroup.conf file On Thu, May 13, 2021 at 5:14 PM Cristóbal Navarro < cristobal.nav

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-05-13 Thread Cristóbal Navarro

emPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 Default=YES PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 MaxMemPerNode=42 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 On Tue, Apr 13, 2021 at 9:38 PM Cristóbal Navarro < cristobal.navarr...

Re: [slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

2021-05-08 Thread Cristóbal Navarro

de the accurately > describes what you want to allow access to at all. > > Then you limit what is allowed to be requested in the partition definition > and/or a QOS (if you are using accounting). > > Brian Andrus > On 5/7/2021 8:11 PM, Cristóbal Navarro wrote: > >

[slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

2021-05-07 Thread Cristóbal Navarro

Hi community, I am unable to tell if SLURM is handling the following situation efficiently in terms of CPU affinities at each partition. Here we have a very small cluster with just one GPU node with 8x GPUs, that offers two partitions --> "gpu" and "cpu". Part of the Config File ## Nodes list ## u

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-04-27 Thread Cristóbal Navarro

se use srun or sbatch...". > > > > Patrick > > > > Le 24/04/2021 à 10:03, Ole Holm Nielsen a écrit : > >> On 24-04-2021 04:37, Cristóbal Navarro wrote: > >>> Hi Community, > >>> I have a set of users still not so familiar with slurm, a

[slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-04-23 Thread Cristóbal Navarro

Hi Community, I have a set of users still not so familiar with slurm, and yesterday they bypassed srun/sbatch and just ran their CPU program directly on the head/login node thinking it would still run on the compute node. I am aware that I will need to teach them some basic usage, but in the meanwh

Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-15 Thread Cristóbal Navarro

when you built the slurm source it > wasn't able to find the nvml devel packages. if you look in where you > installed slurm, in lib/slurm you should have a gpu_nvml.so. do you? > > On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro > wrote: > > > > typing error,

Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-14 Thread Cristóbal Navarro

typing error, should be --> **located at /usr/include/nvml.h** On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote: > Hi community, > I have set up the configuration files as mentioned in the documentation, > but the slurmd of the GPU-compu

[slurm-users] AutoDetect=nvml throwing an error message

2021-04-14 Thread Cristóbal Navarro

Hi community, I have set up the configuration files as mentioned in the documentation, but the slurmd of the GPU-compute node fails with the following error shown in the log. After reading the slurm documentation, it is not entirely clear to me how to properly set up GPU autodetection for the gres.

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-13 Thread Cristóbal Navarro

by | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> * UoM notice:

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-11 Thread Cristóbal Navarro

ad > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> * UoM notice:

[slurm-users] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-10 Thread Cristóbal Navarro

Hi Community, These last two days I've been trying to understand what is the cause of the "Unable to allocate resources" error I keep getting when specifying --gres=... in a srun command (or sbatch). It fails with the error ➜ srun --gres=gpu:A100:1 nvidia-smi srun: error: Unable to allocate resou

Re: [slurm-users] Is it possible to define multiple partitions for the same node, but each one having a different subset of GPUs?

2021-03-31 Thread Cristóbal Navarro

s=tesla:1 to request > one P100 gpu. > > > > This is an example from https://slurm.schedmd.com/slurm.conf.html > > > > (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G") > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On

[slurm-users] Is it possible to define multiple partitions for the same node, but each one having a different subset of GPUs?

2021-03-31 Thread Cristóbal Navarro

Hi Community, I was checking the documentation but could find clear information on what I am trying to do. Here at the university we have a large compute node with 3 classes of GPUs. Lets say the node's hostname is "gpuComputer", it is composed of: - 4x large GPUs - 4x medium GPUs (MIG devic

[slurm-users] Re: Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

[slurm-users] Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

[slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Re: [slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

[slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

[slurm-users] What is an easy way to prevent users run programs on the master/login node.

Re: [slurm-users] AutoDetect=nvml throwing an error message

Re: [slurm-users] AutoDetect=nvml throwing an error message

[slurm-users] AutoDetect=nvml throwing an error message

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

[slurm-users] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Re: [slurm-users] Is it possible to define multiple partitions for the same node, but each one having a different subset of GPUs?

[slurm-users] Is it possible to define multiple partitions for the same node, but each one having a different subset of GPUs?

22 matches

Site Navigation

Mail list logo

Footer information