Have you restarted munge on all hosts?
On 6/25/19 4:38 PM, Valerio Bellizzomi wrote:
On Tue, 2019-06-25 at 16:32 +0200, Valerio Bellizzomi wrote:
On Tue, 2019-06-25 at 08:48 -0400, Eli V wrote:
My first guess would be that the host is not listed as one of the two
controllers in the slurm.conf.
On Tue, 2019-06-25 at 16:32 +0200, Valerio Bellizzomi wrote:
> On Tue, 2019-06-25 at 08:48 -0400, Eli V wrote:
> > My first guess would be that the host is not listed as one of the two
> > controllers in the slurm.conf. Also, keep in mind munge, and thus
> > slurm is very sensitive to lack of clock
Thanks, Ahmet, in the settings it mentions servername and servertype so it
seemed at first glance that some relatively elaborate communication with a
license server was involved. I'll try if with some scripting around the
options you mentioned I can find a suitable solution.
Greetings, Pim
On Tue
On Tue, 2019-06-25 at 08:48 -0400, Eli V wrote:
> My first guess would be that the host is not listed as one of the two
> controllers in the slurm.conf. Also, keep in mind munge, and thus
> slurm is very sensitive to lack of clock synchronization between
> nodes. FYI, I run a hand built slurm 18.08
Just FYI, I tried the shared state on NFS once, and it didn't work
well. Switched to native client glusterfs shared between the 2
controller nodes and haven't had a problem with it since.
On Tue, Jun 25, 2019 at 6:32 AM Buckley, Ronan wrote:
>
> Is there a way to diagnose if the I/O to the
> /cm
Hi;
As far as I know, the slurm is not able to work (communicate) with
reprise license manager or any other license manager. Slurm just sums
the used licenses according to the -L parameter of the jobs, and
subtracts this sum from the total license count which given by using
"sacctmgr add/modi
My first guess would be that the host is not listed as one of the two
controllers in the slurm.conf. Also, keep in mind munge, and thus
slurm is very sensitive to lack of clock synchronization between
nodes. FYI, I run a hand built slurm 18.08.07 on debian 8 & 9 without
issues. Haven't tried 10 yet
Hi all,
We use a piece of software that works with reprise license management via
the cloud http://www.reprisesoftware.com./products/cloud-licensing.php
According to https://slurm.schedmd.com/licenses.html slurm is able to work
with reprise license manager, though the examples at schedmd seem to
Is there a way to diagnose if the I/O to the
/cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the
NFS storage is the cause of the socket errors?
What values/threshold from the nfsiostat command would signal the NFS storage
as the bottleneck?
From: Buckley, Ronan
Sent: T
Hi,
I can reproduce the problem by submitting a job array of 700+.
The slurmctld log file is also regularly outputting:
[2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider
configuring max_rpc_cnt
[2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider
confi
Hi
It seems a problem we discussed a few days ago:
https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html
But in that thread I thinking we were using slurm with workflow managers. It's
interesting that you have the problem after adding the second server and with
NFS share. Do you
Hi,
Since configuring a backup slurm controller (including moving the
StateSaveLocation from a local disk to a NFS share), we are seeing these errors
in the slurmctld logs on a regular basis:
Socket timed out on send/recv operation
It sometimes occurs when a job array is started and squeue wil
12 matches
Mail list logo