Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton
Hi Sean, Thank you! It was a permissions issue and it’s not complaining anymore about cred/munge. I appreciate your help. Thanks, Jesse > On Jan 23, 2024, at 3:34 PM, Sean Crosby wrote: > > slurmctld runs as the user slurm, whereas slurmd runs as root. > > Make sure the permissions on /ap

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski
Ah, I see — no, it’s 24.08. That’s why I didn’t find any reference to it. Carry on! :-D -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 97

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton
Yeah, 24.0.8 is the bleeding edge version. I wanted to try the latest in case it was a bug in 20.x.x. I’m happy to go back to any older Slurm version but I don’t think that will matter much if the issue occurs on both Slurm 20 and Slurm 24. git clone https://github.com/SchedMD/slurm.git Thank

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski
On Jan 23, 2024, at 18:14, Jesse Aiton wrote: This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8 Thank you, Jesse I’m not sure what version you’re actually running, but I don’t believe there is a 24.0.8. The latest version I’m aware of is 23.11.2. -- #BlackLivesMatter __

Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Sean Crosby
slurmctld runs as the user slurm, whereas slurmd runs as root. Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm to read the files e.g. you could do (as root) sudo -u slurm ls /app/slurm-24.0.8/lib/slurm and see if the slurm user can read the directory (as well as t

[slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton
Hello Slurm Folks, I have a weird issue where on the same server, which acts as both a controller and a node, slurmctld can’t find cred_munge.so slurmctld: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does

[slurm-users] Slurm version 23.11.2 is now available

2024-01-23 Thread Tim McMullan
We are pleased to announce the availability of Slurm version 23.11.2. The 23.11.2 release includes a number of fixes to stability and various bug fixes. Some notable changes include several fixes to the new scontrol reconfigure method, including one that could result in jobs getting cancelled

Re: [slurm-users] GPU devices mapping with job's cgroup in cgroups v2 using eBPF

2024-01-23 Thread Charles Hedrick
To see the specific GPU allocated, I think this will do it: scontrol show job -d | grep -E "JobId=| GRES" From: slurm-users on behalf of Mahendra Paipuri Sent: Sunday, January 7, 2024 3:33 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] GPU devices

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Charles Hedrick
See my comments on https://bugs.launchpad.net/bugs/2050098. There's a pretty simple fix in slurm. As far as I can tell, there's nothing wrong with the slurm code. But it's using an option that it doesn't actually need, and that seems to be causing trouble in the kernel. __

Re: [slurm-users] Issues with Slurm 23.11.1

2024-01-23 Thread Brian Haymore
Do you have a firewall between the slurmd and the slurmctld daemons? If yes, do you know what kind of idle timeout that firewall has for expiring idle sessions? I ran into something somewhat similar but for me it was between the slurmctld and slurmdbd where a recent change they made had one di

Re: [slurm-users] Database cluster

2024-01-23 Thread Daniel L'Hommedieu
Xand, Thanks - that’s great to hear. I was thinking of using Anycast to achieve the same thing, but good to know that keepalived is a viable solution as well. Best, Daniel > On Jan 23, 2024, at 09:29, Xand Meaden wrote: > > Hi, > > We are using Percona XtraDB cluster to achieve HA for our S

Re: [slurm-users] Database cluster

2024-01-23 Thread Xand Meaden
Hi, We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived. Regards, Xand From: slurm-users on behalf of Daniel L'Hommedieu Sent: 22 January 20

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Tim Schneider
Hi, I have filed a bug report with SchedMD (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told me they cannot invest time in this issue since I don't have a support contract. Maybe they will look into it once it affects more people or someone important enough. So far, I h

Re: [slurm-users] Database cluster

2024-01-23 Thread Daniel L'Hommedieu
Hi Diego. In our setup, the database is critical. We have some wrapper scripts that consult the database for information, and we also set environment variables on login, based on user/partition associations. If the database is down, none of those things work. I doubt there is appetite in the

[slurm-users] Issues with Slurm 23.11.1

2024-01-23 Thread Fokke Dijkstra
Dear all, Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the communication between the slurmctld and slurmd processes. We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately some nodes are in a different network preventing full internode communicat

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-23 Thread Diego Zuccato
Also, remembre to specify the memory used by the job if you treat it as a TRES if you're using CR_*Memory to select resources. Diego Il 18/01/2024 15:44, Ümit Seren ha scritto: This line also has tobe changed: #SBATCH --gpus-per-node=4#SBATCH --gpus-per-node=1 --gpus-per-nodeseems to be th

Re: [slurm-users] Database cluster

2024-01-23 Thread Diego Zuccato
IIUC the database is not "critical": if it goes down, you lose access to some statistics. But job data gets cached anyway and the db will be updated when it comes back online. Diego Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto: Community: What do you do to ensure database reliability i