-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Ciao Alessandro,
> setting MessageTimeout to 20 didn't solve it :( > > looking at slurmctld logs I noticed many warning like these > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large > processing time from _slurm_rpc_dump_partitions: usec=42850604 > began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]: > Warning: Note very large processing time from > load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16 > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large > processing time from _slurmctld_background: usec=44861653 > began=05:20:13.257 And: > 271 Note very large processing time from _slurm_rpc_dump_partitions: > 67 Note very large processing time from load_part_uid_allow_list: I believe these values are in microseconds, so an average of 44 seconds per call, mostly related to partition information. Given that our configuration has the maximum value set of 90 seconds, I'd again recommend another adjustment, perhaps to 60 seconds. I'm not sure if redefining your partitions will help, but you do have several partitions which contain the same set of nodes that could be condensed - decreasing the amount of partitions. For example, the partitions bdw_all_serial & bdw_all_rcm could be consolidated into a single partition by: 1.) Using AllowQOS=bdw_all_serial,bdw_all_rcm; 2.) Setting MaxTime to 04:00:00 and defining a MaxWall via each QOS (since one partition has 04:00:00 and the other 03:00:00). The same could be done for the partitions skl_fua_{prod,bprod,lprod} as well. HTH, John DeSantis On Tue, 16 Jan 2018 11:22:44 +0100 Alessandro Federico <a.feder...@cineca.it> wrote: > Hi, > > setting MessageTimeout to 20 didn't solve it :( > > looking at slurmctld logs I noticed many warning like these > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large > processing time from _slurm_rpc_dump_partitions: usec=42850604 > began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slurmctld[22307]: > Warning: Note very large processing time from > load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16 > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large > processing time from _slurmctld_background: usec=44861653 > began=05:20:13.257 > > they are generated in many functions: > > [root@r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16 > 00:00:00' | grep -oP 'Note very large processing time from \w+:' | > sort | uniq -c 4 Note very large processing time from > dump_all_job_state: 67 Note very large processing time from > load_part_uid_allow_list: 67 Note very large processing time from > _slurmctld_background: 7 Note very large processing time from > _slurm_rpc_complete_batch_script: 4 Note very large processing time > from _slurm_rpc_dump_jobs: 3 Note very large processing time from > _slurm_rpc_dump_job_user: 271 Note very large processing time from > _slurm_rpc_dump_partitions: 5 Note very large processing time from > _slurm_rpc_epilog_complete: 1 Note very large processing time from > _slurm_rpc_job_pack_alloc_info: 3 Note very large processing time > from _slurm_rpc_step_complete: > > processing times are always around tens of seconds. > > I'm attaching sdiag output and slurm.conf. > > thanks > ale > > ----- Original Message ----- > > From: "Trevor Cooper" <tcoo...@sdsc.edu> > > To: "Slurm User Community List" <slurm-users@lists.schedmd.com> > > Sent: Tuesday, January 16, 2018 12:10:21 AM > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on > > send/recv operation > > > > Alessandro, > > > > You might want to consider tracking your Slurm scheduler diagnostics > > output with some type of time-series monitoring system. The > > time-based history has proven more helpful at times than log > > contents by themselves. > > > > See Giovanni Torres' post on setting this up... > > > > http://giovannitorres.me/graphing-sdiag-with-graphite.html > > > > -- Trevor > > > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico > > > <a.feder...@cineca.it> wrote: > > > > > > Hi John > > > > > > thanks for the info. > > > slurmctld doesn't report anything about the server thread count in > > > the logs > > > and sdiag show only 3 server threads. > > > > > > We changed the MessageTimeout value to 20. > > > > > > I'll let you know if it solves the problem. > > > > > > Thanks > > > ale > > > > > > ----- Original Message ----- > > >> From: "John DeSantis" <desan...@usf.edu> > > >> To: "Alessandro Federico" <a.feder...@cineca.it> > > >> Cc: slurm-users@lists.schedmd.com, "Isabella Baccarelli" > > >> <i.baccare...@cineca.it>, hpc-sysmgt-i...@cineca.it > > >> Sent: Friday, January 12, 2018 7:58:38 PM > > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on > > >> send/recv operation > > >> > > >> Ciao Alessandro, > > >> > > >>> Do we have to apply any particular setting to avoid incurring > > >>> the problem? > > >> > > >> What is your "MessageTimeout" value in slurm.conf? If it's at > > >> the default of 10, try changing it to 20. > > >> > > >> I'd also check and see if the slurmctld log is reporting anything > > >> pertaining to the server thread count being over its limit. > > >> > > >> HTH, > > >> John DeSantis > > >> > > >> On Fri, 12 Jan 2018 11:32:57 +0100 > > >> Alessandro Federico <a.feder...@cineca.it> wrote: > > >> > > >>> Hi all, > > >>> > > >>> > > >>> we are setting up SLURM 17.11.2 on a small test cluster of about > > >>> 100 > > >>> nodes. Sometimes we get the error in the subject when running > > >>> any SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...) > > >>> > > >>> > > >>> Do we have to apply any particular setting to avoid incurring > > >>> the problem? > > >>> > > >>> > > >>> We found this bug report > > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the > > >>> previous SLURM version and we do not set debug3 on slurmctld. > > >>> > > >>> > > >>> thanks in advance > > >>> ale > > >>> > > >> > > >> > > > > > > -- > > > Alessandro Federico > > > HPC System Management Group > > > System & Technology Department > > > CINECA www.cineca.it > > > Via dei Tizii 6, 00185 Rome - Italy > > > phone: +39 06 44486708 > > > > > > All work and no play makes Jack a dull boy. > > > All work and no play makes Jack a dull boy. > > > All work and no play makes Jack... > > > > > > > > > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6 vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4 KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h /93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs= =VWA7 -----END PGP SIGNATURE-----