Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Davide DelVento
I run a cluster we bought from ACT and recently updated to ClusterVisor v1.0 The new version has (among many things) a really nice view of individual jobs resource utilization (GPUs, memory, CPU, temperature, etc). I did not pay attention to the overall statistics, so I am not sure how CV fares th

Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-07-24 Thread Cristóbal Navarro
Hello Angel and Community, I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02. When I execute `slurmd` service, it status shows failed with the following information below. As of today, what is the best solution to this problem? I am really not s

Re: [slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Angel de Vicente
Hello, Matthew Brown writes: > Minimum  memory required per allocated CPU. ... Note that if the job's > --mem-per-cpu value exceeds the configured MaxMemPerCPU, then  the > user's  limit  will be treated as a memory limit per task Ah, thanks, I should've read the documentation more carefully.

[slurm-users] Partition not allowing subaccount use

2023-07-24 Thread Groner, Rob
I've setup a partition THING with AllowAccounts=stuff. I then use sacctmgr to create the stuff account and a mystuff account whose parent is stuff. My understanding is that this would make mystuff a subaccount of stuff. The description for specifying allowaccount in a partition definition in

Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Magnus Jonsson
We are feeding job usage information into a Prometheus database for our users (and us) to look at (via Grafana). It is also possible to get a lite of jobs that are under using memory, gpu or whatever metric you feed into the database. It’s a live feed with ~30s resolution from both compute jobs

Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Matthew Brown
I use seff all the time as a first order approximation. It's a good hint at what's going on with a job but doesn't give much detail. We are in the process of integrating the Supremm node utilization capture tool with our clusters and with our local XDMOD installation. Plain old XDMOD can ingest th

Re: [slurm-users] MPI_Init_thread error

2023-07-24 Thread Fatih Ertinaz
Hi Aziz, This seems like an MPI environment issue rather than a Slurm problem. Make sure that MPI modules are loaded as well. You can see the list of loaded modules via `module list`. This should give you if SU2 dependencies are available in your runtime. If they are not loaded implicitly, you ne

[slurm-users] MPI_Init_thread error

2023-07-24 Thread Aziz Ogutlu
Hi there all, We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 + gcc 8.5.0. When we run command below for call SU2, we get an error message: /$ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i/ /$ module load su2/7.5.1/ /$ SU2_CFD config.cfg/ /*** An

[slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Will Furnell - STFC UKRI
Hello, I am aware of 'seff', which allows you to check the efficiency of a single job, which is good for users, but as a cluster administrator I would like to be able to track the efficiency of all jobs from all users on the cluster, so I am able to 're-educate' users that may be running jobs t

Re: [slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Matthew Brown
Slurm will allocate more cpus to cover the memory requirement. Use sacct's query fields to compare Requested Resources vs. Allocated Resources: $ scontrol show part normal_q | grep MaxMem DefMemPerCPU=1920 MaxMemPerCPU=1920 $ srun -n 1 --mem-per-cpu=4000 --partition=normal_q --account=arcadm h

Re: [slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Groner, Rob
I'm not sure I can help with the rest, but the EnforcePartLimits setting will only reject a job at submission time that exceeds partition​ limits, not overall cluster limits. I don't see anything, offhand, in the interactive partition definition that is exceeded by your request for 4 GB/CPU. R

[slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Angel de Vicente
Hello, I'm trying to get Slurm to control the memory used per CPU, but it does not seem to enforce the MaxMemPerCPU option in slurm.conf This is running in Ubuntu 22.04 (cgroups v2), Slurm 23.02.3. Relevant configuration options: ,cgroup.conf | AllowedRAMSpace=100 | ConstrainCores=yes | Con

Re: [slurm-users] Custom Gres for SSD

2023-07-24 Thread Shunran Zhang
Hi Matthias, Thank you for your info. The prolog/epilog way of managing it does look quite promising. Indeed in my setup I only want one job per node per SSD-set. Our tasks that require the scratch space are more IO bound - we are more worried about the IO usage than the actual disk space us

Re: [slurm-users] Custom Gres for SSD

2023-07-24 Thread Matthias Loose
On 2023-07-24 09:50, Matthias Loose wrote: Hi Shunran, just read your question again. If you dont want users to share the SSD, like at all even if both have requested it you can basically skip the quota part of my awnser. If you really only want one user per SSD per node you should set the

Re: [slurm-users] Custom Gres for SSD

2023-07-24 Thread Matthias Loose
Hi Shunran, we do something very similar. I have nodes with 2 SSDs in a Raid1 mounted on /local. We defined a gres ressource just like you and called it local. We define the ressource in the gres.conf like this: # LOCAL NodeName=hpc-node[01-10] Name=local and add the ressource in counts