date:20240123

Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton

Hi Sean, Thank you! It was a permissions issue and it’s not complaining anymore about cred/munge. I appreciate your help. Thanks, Jesse > On Jan 23, 2024, at 3:34 PM, Sean Crosby wrote: > > slurmctld runs as the user slurm, whereas slurmd runs as root. > > Make sure the permissions on /ap

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski

Ah, I see — no, it’s 24.08. That’s why I didn’t find any reference to it. Carry on! :-D -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 97

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton

Yeah, 24.0.8 is the bleeding edge version. I wanted to try the latest in case it was a bug in 20.x.x. I’m happy to go back to any older Slurm version but I don’t think that will matter much if the issue occurs on both Slurm 20 and Slurm 24. git clone https://github.com/SchedMD/slurm.git Thank

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski

On Jan 23, 2024, at 18:14, Jesse Aiton wrote: This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8 Thank you, Jesse I’m not sure what version you’re actually running, but I don’t believe there is a 24.0.8. The latest version I’m aware of is 23.11.2. -- #BlackLivesMatter __

Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Sean Crosby

slurmctld runs as the user slurm, whereas slurmd runs as root. Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm to read the files e.g. you could do (as root) sudo -u slurm ls /app/slurm-24.0.8/lib/slurm and see if the slurm user can read the directory (as well as t

[slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton

Hello Slurm Folks, I have a weird issue where on the same server, which acts as both a controller and a node, slurmctld can’t find cred_munge.so slurmctld: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does

[slurm-users] Slurm version 23.11.2 is now available

2024-01-23 Thread Tim McMullan

We are pleased to announce the availability of Slurm version 23.11.2. The 23.11.2 release includes a number of fixes to stability and various bug fixes. Some notable changes include several fixes to the new scontrol reconfigure method, including one that could result in jobs getting cancelled

Re: [slurm-users] GPU devices mapping with job's cgroup in cgroups v2 using eBPF

2024-01-23 Thread Charles Hedrick

To see the specific GPU allocated, I think this will do it: scontrol show job -d | grep -E "JobId=| GRES" From: slurm-users on behalf of Mahendra Paipuri Sent: Sunday, January 7, 2024 3:33 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] GPU devices

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Charles Hedrick

See my comments on https://bugs.launchpad.net/bugs/2050098. There's a pretty simple fix in slurm. As far as I can tell, there's nothing wrong with the slurm code. But it's using an option that it doesn't actually need, and that seems to be causing trouble in the kernel. __

Re: [slurm-users] Issues with Slurm 23.11.1

2024-01-23 Thread Brian Haymore

Do you have a firewall between the slurmd and the slurmctld daemons? If yes, do you know what kind of idle timeout that firewall has for expiring idle sessions? I ran into something somewhat similar but for me it was between the slurmctld and slurmdbd where a recent change they made had one di

Re: [slurm-users] Database cluster

2024-01-23 Thread Daniel L'Hommedieu

Xand, Thanks - that’s great to hear. I was thinking of using Anycast to achieve the same thing, but good to know that keepalived is a viable solution as well. Best, Daniel > On Jan 23, 2024, at 09:29, Xand Meaden wrote: > > Hi, > > We are using Percona XtraDB cluster to achieve HA for our S

Re: [slurm-users] Database cluster

2024-01-23 Thread Xand Meaden

Hi, We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived. Regards, Xand From: slurm-users on behalf of Daniel L'Hommedieu Sent: 22 January 20

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Tim Schneider

Hi, I have filed a bug report with SchedMD (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told me they cannot invest time in this issue since I don't have a support contract. Maybe they will look into it once it affects more people or someone important enough. So far, I h

Re: [slurm-users] Database cluster

2024-01-23 Thread Daniel L'Hommedieu

Hi Diego. In our setup, the database is critical. We have some wrapper scripts that consult the database for information, and we also set environment variables on login, based on user/partition associations. If the database is down, none of those things work. I doubt there is appetite in the

[slurm-users] Issues with Slurm 23.11.1

2024-01-23 Thread Fokke Dijkstra

Dear all, Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the communication between the slurmctld and slurmd processes. We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately some nodes are in a different network preventing full internode communicat

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-23 Thread Diego Zuccato

Also, remembre to specify the memory used by the job if you treat it as a TRES if you're using CR_*Memory to select resources. Diego Il 18/01/2024 15:44, Ümit Seren ha scritto: This line also has tobe changed: #SBATCH --gpus-per-node=4#SBATCH --gpus-per-node=1 --gpus-per-nodeseems to be th

Re: [slurm-users] Database cluster

2024-01-23 Thread Diego Zuccato

IIUC the database is not "critical": if it goes down, you lose access to some statistics. But job data gets cached anyway and the db will be updated when it comes back online. Diego Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto: Community: What do you do to ensure database reliability i

Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

[slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

[slurm-users] Slurm version 23.11.2 is now available

Re: [slurm-users] GPU devices mapping with job's cgroup in cgroups v2 using eBPF

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Re: [slurm-users] Issues with Slurm 23.11.1

Re: [slurm-users] Database cluster

Re: [slurm-users] Database cluster

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Re: [slurm-users] Database cluster

[slurm-users] Issues with Slurm 23.11.1

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Re: [slurm-users] Database cluster

17 matches

Site Navigation

Mail list logo

Footer information