Sat, Sep 28, 2024, 2:13 PM Cristóbal Navarro <
cristobal.navarr...@gmail.com> wrote:
> Dear community,
> I am having a strange issue I have been unable to find the cause. Last
> week I did a full update on the cluster, which is composed of the master
> node, and two comput
Dear community,
I am having a strange issue I have been unable to find the cause. Last week
I did a full update on the cluster, which is composed of the master node,
and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom GPU
server). After the update, I got
- master node ended up w
Stefan Fleischmann wrote:
> > On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro
> > wrote:
> >> Many thanks
> >> One question? Do we have to apply this patch (and recompile slurm i
> >> guess) only on the compute-node with problems?
> >> Also, I notic
Many thanks
One question? Do we have to apply this patch (and recompile slurm i guess)
only on the compute-node with problems?
Also, I noticed the patch now appears as "obsolete", is that ok?
On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann wrote:
> Turns out I was wrong, this is not a problem
Hi Tim and community,
We are currently having the same issue (cgroups not working it seems,
showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days
ago after a full update (apt upgrade). Now whenever we launch a job for
that partition, we get the error message mentioned by Tim.
Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on
Ubuntu 22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following
information below.
As of today, what is the best solution to this problem? I am really not
s
MSpace*, where processes in the step will
> be killed, but the step will be left active, possibly with other processes
> left running.
>
>
>
> On 12/01/2023 03:47:53, Cristóbal Navarro wrote:
>
> Hi Slurm community,
> Recently we found a small problem triggered by one of our
Hi Slurm community,
Recently we found a small problem triggered by one of our jobs. We have a
*MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file,
however we found out that a job that started with mem=65536, and after
hours of execution it was able to grow its memory usage durin
Hi Community,
just wanted to share that this problem got solved with the help of pyxis
developers
https://github.com/NVIDIA/pyxis/issues/47
The solution was to add
ConstrainDevices=yes
as it was missing in the cgroup.conf file
On Thu, May 13, 2021 at 5:14 PM Cristóbal Navarro <
cristobal.nav
emPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00
State=UP Nodes=nodeGPU01 Default=YES
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384
MaxMemPerNode=42 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01
On Tue, Apr 13, 2021 at 9:38 PM Cristóbal Navarro <
cristobal.navarr...
de the accurately
> describes what you want to allow access to at all.
>
> Then you limit what is allowed to be requested in the partition definition
> and/or a QOS (if you are using accounting).
>
> Brian Andrus
> On 5/7/2021 8:11 PM, Cristóbal Navarro wrote:
>
>
Hi community,
I am unable to tell if SLURM is handling the following situation
efficiently in terms of CPU affinities at each partition.
Here we have a very small cluster with just one GPU node with 8x GPUs, that
offers two partitions --> "gpu" and "cpu".
Part of the Config File
## Nodes list
## u
se use srun or sbatch...".
> >
> > Patrick
> >
> > Le 24/04/2021 à 10:03, Ole Holm Nielsen a écrit :
> >> On 24-04-2021 04:37, Cristóbal Navarro wrote:
> >>> Hi Community,
> >>> I have a set of users still not so familiar with slurm, a
Hi Community,
I have a set of users still not so familiar with slurm, and yesterday they
bypassed srun/sbatch and just ran their CPU program directly on the
head/login node thinking it would still run on the compute node. I am aware
that I will need to teach them some basic usage, but in the meanwh
when you built the slurm source it
> wasn't able to find the nvml devel packages. if you look in where you
> installed slurm, in lib/slurm you should have a gpu_nvml.so. do you?
>
> On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
> wrote:
> >
> > typing error,
typing error, should be --> **located at /usr/include/nvml.h**
On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro <
cristobal.navarr...@gmail.com> wrote:
> Hi community,
> I have set up the configuration files as mentioned in the documentation,
> but the slurmd of the GPU-compu
Hi community,
I have set up the configuration files as mentioned in the documentation,
but the slurmd of the GPU-compute node fails with the following error shown
in the log.
After reading the slurm documentation, it is not entirely clear to me how
to properly set up GPU autodetection for the gres.
by | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro <
> cristobal.navarr...@gmail.com> wrote:
>
>> * UoM notice:
ad
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro <
> cristobal.navarr...@gmail.com> wrote:
>
>> * UoM notice:
Hi Community,
These last two days I've been trying to understand what is the cause of the
"Unable to allocate resources" error I keep getting when specifying
--gres=... in a srun command (or sbatch). It fails with the error
➜ srun --gres=gpu:A100:1 nvidia-smi
srun: error: Unable to allocate resou
s=tesla:1 to request
> one P100 gpu.
>
>
>
> This is an example from https://slurm.schedmd.com/slurm.conf.html
>
>
>
> (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G")
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
Hi Community,
I was checking the documentation but could find clear information on what I
am trying to do.
Here at the university we have a large compute node with 3 classes of GPUs.
Lets say the node's hostname is "gpuComputer", it is composed of:
- 4x large GPUs
- 4x medium GPUs (MIG devic
22 matches
Mail list logo