Matthieu, > I would bet on something like LDAP requests taking too much time > because of a missing sssd cache.
Good point! It's easy to forget to check something as "simple" as user look-up when something is taking "too long". John DeSantis On Tue, 16 Jan 2018 19:13:06 +0100 Matthieu Hautreux <matthieu.hautr...@gmail.com> wrote: > Hi, > > In this kind if issues, one good thing to do is to get a backtrace of > slurmctld during the slowdown. You should thus easily identify the > subcomponent responsible for the issue. > > I would bet on something like LDAP requests taking too much time > because of a missing sssd cache. > > Regards > Matthieu > > Le 16 janv. 2018 18:59, "John DeSantis" <desan...@usf.edu> a écrit : > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA512 > > > > Ciao Alessandro, > > > > > setting MessageTimeout to 20 didn't solve it :( > > > > > > looking at slurmctld logs I noticed many warning like these > > > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very > > > large processing time from _slurm_rpc_dump_partitions: > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58 r000u17l01 > > > slurmctld[22307]: Warning: Note very large processing time from > > > load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16 > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large > > > processing time from _slurmctld_background: usec=44861653 > > > began=05:20:13.257 > > > > And: > > > > > 271 Note very large processing time from > > > _slurm_rpc_dump_partitions: 67 Note very large processing time > > > from load_part_uid_allow_list: > > > > I believe these values are in microseconds, so an average of 44 > > seconds per call, mostly related to partition information. Given > > that our configuration has the maximum value set of 90 seconds, I'd > > again recommend another adjustment, perhaps to 60 seconds. > > > > I'm not sure if redefining your partitions will help, but you do > > have several partitions which contain the same set of nodes that > > could be condensed - decreasing the amount of partitions. For > > example, the partitions bdw_all_serial & bdw_all_rcm could be > > consolidated into a single partition by: > > > > 1.) Using AllowQOS=bdw_all_serial,bdw_all_rcm; > > 2.) Setting MaxTime to 04:00:00 and defining a MaxWall via each QOS > > (since one partition has 04:00:00 and the other 03:00:00). > > > > The same could be done for the partitions > > skl_fua_{prod,bprod,lprod} as well. > > > > HTH, > > John DeSantis > > > > > > On Tue, 16 Jan 2018 11:22:44 +0100 > > Alessandro Federico <a.feder...@cineca.it> wrote: > > > > > Hi, > > > > > > setting MessageTimeout to 20 didn't solve it :( > > > > > > looking at slurmctld logs I noticed many warning like these > > > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very > > > large processing time from _slurm_rpc_dump_partitions: > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58 r000u17l01 > > > slurmctld[22307]: Warning: Note very large processing time from > > > load_part_uid_allow_list: usec=44861325 began=05:20:13.257 Jan 16 > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very large > > > processing time from _slurmctld_background: usec=44861653 > > > began=05:20:13.257 > > > > > > they are generated in many functions: > > > > > > [root@r000u17l01 ~]# journalctl -u slurmctld --since='2018-01-16 > > > 00:00:00' | grep -oP 'Note very large processing time from \w+:' > > > | sort | uniq -c 4 Note very large processing time from > > > dump_all_job_state: 67 Note very large processing time from > > > load_part_uid_allow_list: 67 Note very large processing time from > > > _slurmctld_background: 7 Note very large processing time from > > > _slurm_rpc_complete_batch_script: 4 Note very large processing > > > time from _slurm_rpc_dump_jobs: 3 Note very large processing time > > > from _slurm_rpc_dump_job_user: 271 Note very large processing > > > time from _slurm_rpc_dump_partitions: 5 Note very large > > > processing time from _slurm_rpc_epilog_complete: 1 Note very > > > large processing time from _slurm_rpc_job_pack_alloc_info: 3 Note > > > very large processing time from _slurm_rpc_step_complete: > > > > > > processing times are always around tens of seconds. > > > > > > I'm attaching sdiag output and slurm.conf. > > > > > > thanks > > > ale > > > > > > ----- Original Message ----- > > > > From: "Trevor Cooper" <tcoo...@sdsc.edu> > > > > To: "Slurm User Community List" <slurm-users@lists.schedmd.com> > > > > Sent: Tuesday, January 16, 2018 12:10:21 AM > > > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on > > > > send/recv operation > > > > > > > > Alessandro, > > > > > > > > You might want to consider tracking your Slurm scheduler > > > > diagnostics output with some type of time-series monitoring > > > > system. The time-based history has proven more helpful at times > > > > than log contents by themselves. > > > > > > > > See Giovanni Torres' post on setting this up... > > > > > > > > http://giovannitorres.me/graphing-sdiag-with-graphite.html > > > > > > > > -- Trevor > > > > > > > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico > > > > > <a.feder...@cineca.it> wrote: > > > > > > > > > > Hi John > > > > > > > > > > thanks for the info. > > > > > slurmctld doesn't report anything about the server thread > > > > > count in the logs > > > > > and sdiag show only 3 server threads. > > > > > > > > > > We changed the MessageTimeout value to 20. > > > > > > > > > > I'll let you know if it solves the problem. > > > > > > > > > > Thanks > > > > > ale > > > > > > > > > > ----- Original Message ----- > > > > >> From: "John DeSantis" <desan...@usf.edu> > > > > >> To: "Alessandro Federico" <a.feder...@cineca.it> > > > > >> Cc: slurm-users@lists.schedmd.com, "Isabella Baccarelli" > > > > >> <i.baccare...@cineca.it>, hpc-sysmgt-i...@cineca.it > > > > >> Sent: Friday, January 12, 2018 7:58:38 PM > > > > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on > > > > >> send/recv operation > > > > >> > > > > >> Ciao Alessandro, > > > > >> > > > > >>> Do we have to apply any particular setting to avoid > > > > >>> incurring the problem? > > > > >> > > > > >> What is your "MessageTimeout" value in slurm.conf? If it's > > > > >> at the default of 10, try changing it to 20. > > > > >> > > > > >> I'd also check and see if the slurmctld log is reporting > > > > >> anything pertaining to the server thread count being over > > > > >> its limit. > > > > >> > > > > >> HTH, > > > > >> John DeSantis > > > > >> > > > > >> On Fri, 12 Jan 2018 11:32:57 +0100 > > > > >> Alessandro Federico <a.feder...@cineca.it> wrote: > > > > >> > > > > >>> Hi all, > > > > >>> > > > > >>> > > > > >>> we are setting up SLURM 17.11.2 on a small test cluster of > > > > >>> about 100 > > > > >>> nodes. Sometimes we get the error in the subject when > > > > >>> running any SLURM command (e.g. sinfo, squeue, scontrol > > > > >>> reconf, etc...) > > > > >>> > > > > >>> > > > > >>> Do we have to apply any particular setting to avoid > > > > >>> incurring the problem? > > > > >>> > > > > >>> > > > > >>> We found this bug report > > > > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it > > > > >>> regards the previous SLURM version and we do not set debug3 > > > > >>> on slurmctld. > > > > >>> > > > > >>> > > > > >>> thanks in advance > > > > >>> ale > > > > >>> > > > > >> > > > > >> > > > > > > > > > > -- > > > > > Alessandro Federico > > > > > HPC System Management Group > > > > > System & Technology Department > > > > > CINECA www.cineca.it > > > > > Via dei Tizii 6, 00185 Rome - Italy > > > > > phone: +39 06 44486708 > > > > > > > > > > All work and no play makes Jack a dull boy. > > > > > All work and no play makes Jack a dull boy. > > > > > All work and no play makes Jack... > > > > > > > > > > > > > > > > > > > > > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v2 > > > > iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6 > > vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD > > NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4 > > KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB > > xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h > > /93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs= > > =VWA7 > > -----END PGP SIGNATURE----- > >