Hi Sean,
Thank you! It was a permissions issue and it’s not complaining anymore about
cred/munge.
I appreciate your help.
Thanks,
Jesse
> On Jan 23, 2024, at 3:34 PM, Sean Crosby wrote:
>
> slurmctld runs as the user slurm, whereas slurmd runs as root.
>
> Make sure the permissions on /ap
Ah, I see — no, it’s 24.08. That’s why I didn’t find any reference to it.
Carry on! :-D
--
#BlackLivesMatter
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 97
Yeah, 24.0.8 is the bleeding edge version. I wanted to try the latest in case
it was a bug in 20.x.x. I’m happy to go back to any older Slurm version but I
don’t think that will matter much if the issue occurs on both Slurm 20 and
Slurm 24.
git clone https://github.com/SchedMD/slurm.git
Thank
On Jan 23, 2024, at 18:14, Jesse Aiton wrote:
This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8
Thank you,
Jesse
I’m not sure what version you’re actually running, but I don’t believe there is
a 24.0.8. The latest version I’m aware of is 23.11.2.
--
#BlackLivesMatter
__
slurmctld runs as the user slurm, whereas slurmd runs as root.
Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm
to read the files
e.g. you could do (as root)
sudo -u slurm ls /app/slurm-24.0.8/lib/slurm
and see if the slurm user can read the directory (as well as t
Hello Slurm Folks,
I have a weird issue where on the same server, which acts as both a controller
and a node, slurmctld can’t find cred_munge.so
slurmctld: debug3: Trying to load plugin
/app/slurm-24.0.8/lib/slurm/cred_munge.so
slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does
We are pleased to announce the availability of Slurm version 23.11.2.
The 23.11.2 release includes a number of fixes to stability and various
bug fixes. Some notable changes include several fixes to the new
scontrol reconfigure method, including one that could result in jobs
getting cancelled
To see the specific GPU allocated, I think this will do it:
scontrol show job -d | grep -E "JobId=| GRES"
From: slurm-users on behalf of Mahendra
Paipuri
Sent: Sunday, January 7, 2024 3:33 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] GPU devices
See my comments on https://bugs.launchpad.net/bugs/2050098. There's a pretty
simple fix in slurm.
As far as I can tell, there's nothing wrong with the slurm code. But it's using
an option that it doesn't actually need, and that seems to be causing trouble
in the kernel.
__
Do you have a firewall between the slurmd and the slurmctld daemons? If yes,
do you know what kind of idle timeout that firewall has for expiring idle
sessions? I ran into something somewhat similar but for me it was between the
slurmctld and slurmdbd where a recent change they made had one di
Xand,
Thanks - that’s great to hear. I was thinking of using Anycast to achieve the
same thing, but good to know that keepalived is a viable solution as well.
Best,
Daniel
> On Jan 23, 2024, at 09:29, Xand Meaden wrote:
>
> Hi,
>
> We are using Percona XtraDB cluster to achieve HA for our S
Hi,
We are using Percona XtraDB cluster to achieve HA for our Slurm databases.
There is a single virtual IP that will be kept on one of the cluster's servers
using keepalived.
Regards,
Xand
From: slurm-users on behalf of Daniel
L'Hommedieu
Sent: 22 January 20
Hi,
I have filed a bug report with SchedMD
(https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told
me they cannot invest time in this issue since I don't have a support
contract. Maybe they will look into it once it affects more people or
someone important enough.
So far, I h
Hi Diego.
In our setup, the database is critical. We have some wrapper scripts that
consult the database for information, and we also set environment variables on
login, based on user/partition associations. If the database is down, none of
those things work.
I doubt there is appetite in the
Dear all,
Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with
the communication between the slurmctld and slurmd processes.
We are running a cluster with 183 nodes and almost 19000 cores.
Unfortunately some nodes are in a different network preventing full
internode communicat
Also, remembre to specify the memory used by the job if you treat it as
a TRES if you're using CR_*Memory to select resources.
Diego
Il 18/01/2024 15:44, Ümit Seren ha scritto:
This line also has tobe changed:
#SBATCH --gpus-per-node=4#SBATCH --gpus-per-node=1
--gpus-per-nodeseems to be th
IIUC the database is not "critical": if it goes down, you lose access to
some statistics. But job data gets cached anyway and the db will be
updated when it comes back online.
Diego
Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto:
Community:
What do you do to ensure database reliability i
17 matches
Mail list logo