I run a cluster we bought from ACT and recently updated to ClusterVisor v1.0
The new version has (among many things) a really nice view of individual
jobs resource utilization (GPUs, memory, CPU, temperature, etc). I did not
pay attention to the overall statistics, so I am not sure how CV fares
th
Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on
Ubuntu 22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following
information below.
As of today, what is the best solution to this problem? I am really not
s
Hello,
Matthew Brown writes:
> Minimum memory required per allocated CPU. ... Note that if the job's
> --mem-per-cpu value exceeds the configured MaxMemPerCPU, then the
> user's limit will be treated as a memory limit per task
Ah, thanks, I should've read the documentation more carefully.
I've setup a partition THING with AllowAccounts=stuff. I then use sacctmgr to
create the stuff account and a mystuff account whose parent is stuff. My
understanding is that this would make mystuff a subaccount of stuff.
The description for specifying allowaccount in a partition definition in
We are feeding job usage information into a Prometheus database for our users
(and us) to look at (via Grafana).
It is also possible to get a lite of jobs that are under using memory, gpu or
whatever metric you feed into the database.
It’s a live feed with ~30s resolution from both compute jobs
I use seff all the time as a first order approximation. It's a good hint at
what's going on with a job but doesn't give much detail.
We are in the process of integrating the Supremm node utilization capture
tool with our clusters and with our local XDMOD installation. Plain old
XDMOD can ingest th
Hi Aziz,
This seems like an MPI environment issue rather than a Slurm problem.
Make sure that MPI modules are loaded as well. You can see the list of
loaded modules via `module list`. This should give you if SU2 dependencies
are available in your runtime. If they are not loaded implicitly, you ne
Hi there all,
We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 +
gcc 8.5.0.
When we run command below for call SU2, we get an error message:
/$ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i/
/$ module load su2/7.5.1/
/$ SU2_CFD config.cfg/
/*** An
Hello,
I am aware of 'seff', which allows you to check the efficiency of a single job,
which is good for users, but as a cluster administrator I would like to be able
to track the efficiency of all jobs from all users on the cluster, so I am able
to 're-educate' users that may be running jobs t
Slurm will allocate more cpus to cover the memory requirement. Use
sacct's query fields to compare Requested Resources vs. Allocated Resources:
$ scontrol show part normal_q | grep MaxMem
DefMemPerCPU=1920 MaxMemPerCPU=1920
$ srun -n 1 --mem-per-cpu=4000 --partition=normal_q --account=arcadm
h
I'm not sure I can help with the rest, but the EnforcePartLimits setting will
only reject a job at submission time that exceeds partition limits, not
overall cluster limits. I don't see anything, offhand, in the interactive
partition definition that is exceeded by your request for 4 GB/CPU.
R
Hello,
I'm trying to get Slurm to control the memory used per CPU, but it does
not seem to enforce the MaxMemPerCPU option in slurm.conf
This is running in Ubuntu 22.04 (cgroups v2), Slurm 23.02.3.
Relevant configuration options:
,cgroup.conf
| AllowedRAMSpace=100
| ConstrainCores=yes
| Con
Hi Matthias,
Thank you for your info. The prolog/epilog way of managing it does look
quite promising.
Indeed in my setup I only want one job per node per SSD-set. Our tasks
that require the scratch space are more IO bound - we are more worried
about the IO usage than the actual disk space us
On 2023-07-24 09:50, Matthias Loose wrote:
Hi Shunran,
just read your question again. If you dont want users to share the SSD,
like at all even if both have requested it you can basically skip the
quota part of my awnser.
If you really only want one user per SSD per node you should set the
Hi Shunran,
we do something very similar. I have nodes with 2 SSDs in a Raid1
mounted on /local. We defined a gres ressource just like you and called
it local. We define the ressource in the gres.conf like this:
# LOCAL
NodeName=hpc-node[01-10] Name=local
and add the ressource in counts
15 matches
Mail list logo