Hi Trevor thank you very much
we'll give it a try ale ----- Original Message ----- > From: "Trevor Cooper" <tcoo...@sdsc.edu> > To: "Slurm User Community List" <slurm-users@lists.schedmd.com> > Sent: Tuesday, January 16, 2018 12:10:21 AM > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv > operation > > Alessandro, > > You might want to consider tracking your Slurm scheduler diagnostics > output with some type of time-series monitoring system. The > time-based history has proven more helpful at times than log > contents by themselves. > > See Giovanni Torres' post on setting this up... > > http://giovannitorres.me/graphing-sdiag-with-graphite.html > > -- Trevor > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico > > <a.feder...@cineca.it> wrote: > > > > Hi John > > > > thanks for the info. > > slurmctld doesn't report anything about the server thread count in > > the logs > > and sdiag show only 3 server threads. > > > > We changed the MessageTimeout value to 20. > > > > I'll let you know if it solves the problem. > > > > Thanks > > ale > > > > ----- Original Message ----- > >> From: "John DeSantis" <desan...@usf.edu> > >> To: "Alessandro Federico" <a.feder...@cineca.it> > >> Cc: slurm-users@lists.schedmd.com, "Isabella Baccarelli" > >> <i.baccare...@cineca.it>, hpc-sysmgt-i...@cineca.it > >> Sent: Friday, January 12, 2018 7:58:38 PM > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on > >> send/recv operation > >> > >> Ciao Alessandro, > >> > >>> Do we have to apply any particular setting to avoid incurring the > >>> problem? > >> > >> What is your "MessageTimeout" value in slurm.conf? If it's at the > >> default of 10, try changing it to 20. > >> > >> I'd also check and see if the slurmctld log is reporting anything > >> pertaining to the server thread count being over its limit. > >> > >> HTH, > >> John DeSantis > >> > >> On Fri, 12 Jan 2018 11:32:57 +0100 > >> Alessandro Federico <a.feder...@cineca.it> wrote: > >> > >>> Hi all, > >>> > >>> > >>> we are setting up SLURM 17.11.2 on a small test cluster of about > >>> 100 > >>> nodes. Sometimes we get the error in the subject when running any > >>> SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...) > >>> > >>> > >>> Do we have to apply any particular setting to avoid incurring the > >>> problem? > >>> > >>> > >>> We found this bug report > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the > >>> previous SLURM version and we do not set debug3 on slurmctld. > >>> > >>> > >>> thanks in advance > >>> ale > >>> > >> > >> > > > > -- > > Alessandro Federico > > HPC System Management Group > > System & Technology Department > > CINECA www.cineca.it > > Via dei Tizii 6, 00185 Rome - Italy > > phone: +39 06 44486708 > > > > All work and no play makes Jack a dull boy. > > All work and no play makes Jack a dull boy. > > All work and no play makes Jack... > > > > > -- Alessandro Federico HPC System Management Group System & Technology Department CINECA www.cineca.it Via dei Tizii 6, 00185 Rome - Italy phone: +39 06 44486708 All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack...