Re: [slurm-users] What means this error ?

2019-06-25 Thread Marcus Wagner
Have you restarted munge on all hosts? On 6/25/19 4:38 PM, Valerio Bellizzomi wrote: On Tue, 2019-06-25 at 16:32 +0200, Valerio Bellizzomi wrote: On Tue, 2019-06-25 at 08:48 -0400, Eli V wrote: My first guess would be that the host is not listed as one of the two controllers in the slurm.conf.

Re: [slurm-users] What means this error ?

2019-06-25 Thread Valerio Bellizzomi
On Tue, 2019-06-25 at 16:32 +0200, Valerio Bellizzomi wrote: > On Tue, 2019-06-25 at 08:48 -0400, Eli V wrote: > > My first guess would be that the host is not listed as one of the two > > controllers in the slurm.conf. Also, keep in mind munge, and thus > > slurm is very sensitive to lack of clock

Re: [slurm-users] slurm configuration for reprise license mangagement - cloud version

2019-06-25 Thread pim schravendijk
Thanks, Ahmet, in the settings it mentions servername and servertype so it seemed at first glance that some relatively elaborate communication with a license server was involved. I'll try if with some scripting around the options you mentioned I can find a suitable solution. Greetings, Pim On Tue

Re: [slurm-users] What means this error ?

2019-06-25 Thread Valerio Bellizzomi
On Tue, 2019-06-25 at 08:48 -0400, Eli V wrote: > My first guess would be that the host is not listed as one of the two > controllers in the slurm.conf. Also, keep in mind munge, and thus > slurm is very sensitive to lack of clock synchronization between > nodes. FYI, I run a hand built slurm 18.08

Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

2019-06-25 Thread Eli V
Just FYI, I tried the shared state on NFS once, and it didn't work well. Switched to native client glusterfs shared between the 2 controller nodes and haven't had a problem with it since. On Tue, Jun 25, 2019 at 6:32 AM Buckley, Ronan wrote: > > Is there a way to diagnose if the I/O to the > /cm

Re: [slurm-users] slurm configuration for reprise license mangagement - cloud version

2019-06-25 Thread mercan
Hi; As far as I know, the slurm is not able to work (communicate) with reprise license manager or any other license manager. Slurm just sums the used licenses according to the -L parameter of the jobs, and subtracts this sum from the total license count which given by using "sacctmgr add/modi

Re: [slurm-users] What means this error ?

2019-06-25 Thread Eli V
My first guess would be that the host is not listed as one of the two controllers in the slurm.conf. Also, keep in mind munge, and thus slurm is very sensitive to lack of clock synchronization between nodes. FYI, I run a hand built slurm 18.08.07 on debian 8 & 9 without issues. Haven't tried 10 yet

[slurm-users] slurm configuration for reprise license mangagement - cloud version

2019-06-25 Thread pim schravendijk
Hi all, We use a piece of software that works with reprise license management via the cloud http://www.reprisesoftware.com./products/cloud-licensing.php According to https://slurm.schedmd.com/licenses.html slurm is able to work with reprise license manager, though the examples at schedmd seem to

Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

2019-06-25 Thread Buckley, Ronan
Is there a way to diagnose if the I/O to the /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the NFS storage is the cause of the socket errors? What values/threshold from the nfsiostat command would signal the NFS storage as the bottleneck? From: Buckley, Ronan Sent: T

Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

2019-06-25 Thread Buckley, Ronan
Hi, I can reproduce the problem by submitting a job array of 700+. The slurmctld log file is also regularly outputting: [2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider configuring max_rpc_cnt [2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider confi

Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

2019-06-25 Thread Marcelo Garcia
Hi It seems a problem we discussed a few days ago: https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you

[slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

2019-06-25 Thread Buckley, Ronan
Hi, Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis: Socket timed out on send/recv operation It sometimes occurs when a job array is started and squeue wil