Re: [slurm-users] Problems with cgroupsv2

2022-08-16 Thread Alan Orth
Thanks for the advice. I checked munge's log on the system that was most recently affected and found a few hundred of these: 2022-08-16 23:30:56 +0300 Info: Unauthorized credential for client UID=0 GID=0 Not sure if relevant. NTP on the system is synced. I'll keep an eye on munge in the futu

Re: [slurm-users] Problems with cgroupsv2

2022-08-16 Thread Timony, Mick
When I see odd behaviour I've found it sometimes related to either NTP issues (the time is off) or munge errors: * Is NTP running and is the time accurate * Look for munge errors: * /var/log/munge/munged.log * sudo systemctl status munge If it's a munge error, usually resta

Re: [slurm-users] Problems with cgroupsv2

2022-08-16 Thread Alan Orth
I re-installed SLURM 22.05.3 and then restarted slurmd and now it's working: # dnf reinstall slurm slurm-slurmd slurm-devel slurm-pam_slurm # systemctl restart slurmd The dnf.log shows that the versions were the same, so there was no mismatch or anything: 2022-08-16T23:29:02+0300 DEBUG Reinstall

[slurm-users] help about srun and allocating message is not printed

2022-08-16 Thread Nandex
>> Hello, >> >> I have a problem with slurm and I cannot found the solution. >> When I launch an srun to allocate an interactive node: >> srun --partition=default --nodes=1 --time=01:00:00 --pty bash -i >> >> srun message don't appear, but if I have a job launched into the node and >> try a

[slurm-users] Problems with cgroupsv2

2022-08-16 Thread Alan Orth
Dear list, I've been using cgroupsv2 with SLURM 22.05 on CentOS Stream 8 successfully for a few months now. Recently a few of my nodes have started having problems starting slurmd. The log shows: [2022-08-16T20:52:58.439] slurmd version 22.05.3 started [2022-08-16T20:52:58.439] error: Controller

Re: [slurm-users] SlurmdSpoolDir

2022-08-16 Thread Kamil Wilczek
Maybe this was a noob question, I've just solved my problem. I'll share my thoughts. I returned to my original settings and rerun Ansible's playbook, reconfiguring the SlurmdSpoolDir. * https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir_1 Maybe it is writable by root, because root can

[slurm-users] Intel MPI issue with slurm sbatch

2022-08-16 Thread Joe Teumer
Hello! Is there a way to turn off slurm MPI hooks? A job submitted via sbatch executes Intel MPI and the thread affinity settings are incorrect. However, running MPI manually over SSH works and all bindings are correct. We are looking to run our MPI jobs via slurm sbatch and have the same behavio

[slurm-users] SlurmdSpoolDir

2022-08-16 Thread Kamil Wilczek
Dear Slurm Users, recently, I have started a new instance of my cluster with Slurm 22.05.2 (built from source). Evertyhing seems to be configured properly and working fine except "sbatch". The error is quite self-explanatory and I thought it would be quite easy to fix directory permissions. slur

[slurm-users] "Incompatible plugin version" after upgrade

2022-08-16 Thread Alan Orth
Dear list, Twice this month I've had jobs stuck in completing state (CG). When I go to the compute node and check slurmd.log I see a message about "incompatible plugin version", for example: [2022-08-16T03:36:25.823] [748139.batch] done with job [2022-08-16T12:54:21.404] [748139.extern] plugin_lo