We put SSSD caches on a RAMDISK which helped a little bit with performance.
----- On 22 Jan, 2018, at 02:38, Alessandro Federico a.feder...@cineca.it wrote: | Hi John, | | just an update... | we not have a solution for the SSSD issue yet, but we changed the ACL | on the 2 partitions from AllowGroups=g2 to AllowAccounts=g2 and the | slowdown has gone. | | Thanks for the help | ale | | ----- Original Message ----- |> From: "Alessandro Federico" <a.feder...@cineca.it> |> To: "John DeSantis" <desan...@usf.edu> |> Cc: hpc-sysmgt-i...@cineca.it, "Slurm User Community List" |> <slurm-users@lists.schedmd.com>, "Isabella Baccarelli" |> <i.baccare...@cineca.it> |> Sent: Wednesday, January 17, 2018 5:41:54 PM |> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv |> operation |> |> Hi John |> |> thanks for the infos. |> We are investigating the slowdown of sssd and I found some bug |> reports regarding slow sssd query |> with almost the same backtrace. Hopefully an update of sssd could |> solve this issue. |> |> We'll let you know if we found a solution. |> |> thanks |> ale |> |> ----- Original Message ----- |> > From: "John DeSantis" <desan...@usf.edu> |> > To: "Alessandro Federico" <a.feder...@cineca.it> |> > Cc: "Slurm User Community List" <slurm-users@lists.schedmd.com>, |> > "Isabella Baccarelli" <i.baccare...@cineca.it>, |> > hpc-sysmgt-i...@cineca.it |> > Sent: Wednesday, January 17, 2018 3:30:43 PM |> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on |> > send/recv operation |> > |> > Ale, |> > |> > > As Matthieu said it seems something related to SSS daemon. |> > |> > That was a great catch by Matthieu. |> > |> > > Moreover, only 3 SLURM partitions have the AllowGroups ACL |> > |> > Correct, which may seem negligent, but after each `scontrol |> > reconfigure`, slurmctld restart, and/or AllowGroups= partition |> > update, |> > the mapping of UID's for each group will be updated. |> > |> > > So why does the UID-GID mapping take so long? |> > |> > We attempted to use "AllowGroups" previously, but we found (even |> > with |> > sssd.conf tuning regarding enumeration) that unless the group was |> > local |> > (/etc/group), we were experiencing delays before the AllowGroups |> > parameter was respected. This is why we opted to use SLURM's |> > AllowQOS/AllowAccounts instead. |> > |> > You would have to enable debugging on your remote authentication |> > software to see where the hang-up is occurring (if it is that at |> > all, |> > and not just a delay with the slurmctld). |> > |> > Given the direction that this is going - why not replace the |> > "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="? |> > |> > > @John: we defined many partitions on the same nodes but in the |> > > production cluster they will be more or less split across the 6K |> > > nodes. |> > |> > Ok, that makes sense. Looking initially at your partition |> > definitions, |> > I immediately thought of being DRY, especially since the "finer" |> > tuning |> > between the partitions could easily be controlled via the QOS' |> > allowed |> > to access the resources. |> > |> > John DeSantis |> > |> > On Wed, 17 Jan 2018 13:20:49 +0100 |> > Alessandro Federico <a.feder...@cineca.it> wrote: |> > |> > > Hi Matthieu & John |> > > |> > > this is the backtrace of slurmctld during the slowdown |> > > |> > > (gdb) bt |> > > #0 0x00007fb0e8b1e69d in poll () from /lib64/libc.so.6 |> > > #1 0x00007fb0e8617bfa in sss_cli_make_request_nochecks () |> > > from /lib64/libnss_sss.so.2 #2 0x00007fb0e86185a3 in |> > > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3 |> > > 0x00007fb0e8619104 in _nss_sss_getpwnam_r () |> > > from /lib64/libnss_sss.so.2 #4 0x00007fb0e8aef07d in |> > > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5 |> > > 0x00007fb0e9360986 in _getpwnam_r (result=<optimized out>, |> > > bufsiz=<optimized out>, buf=<optimized out>, pwd=<optimized out>, |> > > name=<optimized out>) at uid.c:73 #6 uid_from_string |> > > (name=0x1820e41 |> > > "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7 |> > > 0x000000000043587d in get_group_members (group_name=0x10ac500 |> > > "g2") |> > > at groups.c:139 #8 0x000000000047525a in _get_groups_members |> > > (group_names=<optimized out>) at partition_mgr.c:2006 #9 |> > > 0x0000000000475505 in _update_part_uid_access_list |> > > (x=0x7fb03401e650, |> > > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10 |> > > 0x00007fb0e92ab675 |> > > in |> > > list_for_each (l=0x1763e50, f=f@entry=0x4754d8 |> > > <_update_part_uid_access_list>, arg=arg@entry=0x7fff07f13bf4) at |> > > list.c:420 #11 0x000000000047911a in load_part_uid_allow_list |> > > (force=1) at partition_mgr.c:1971 #12 0x0000000000428e5c in |> > > _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main |> > > (argc=<optimized out>, argv=<optimized out>) at controller.c:601 |> > > |> > > As Matthieu said it seems something related to SSS daemon. |> > > However we don't notice any slowdown due to SSSd in our |> > > environment. |> > > As I told you before, we are just testing SLURM on a small 100 |> > > nodes |> > > cluster before going into production with about 6000 nodes next |> > > Wednesday. At present the other nodes are managed by PBSPro and |> > > the |> > > 2 |> > > PBS servers are running on the same nodes as the SLURM |> > > controllers. |> > > PBS queues are also configured with users/groups ACLs and we |> > > never |> > > noticed any similar slowdown. |> > > |> > > Moreover, only 3 SLURM partitions have the AllowGroups ACL |> > > |> > > [root@mgmt01 slurm]# grep AllowGroups slurm.conf |> > > PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO |> > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00 State=UP |> > > QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2 |> > > PartitionName=bdw_fua_gw Nodes=r040c03s0[1,2] Default=NO |> > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00 State=UP |> > > QOS=bdw_fua_gw DenyQos=bdw_qos_special AllowGroups=g2 |> > > PartitionName=bdw_fua_gwg2 Nodes=r040c03s0[1,2] Default=NO |> > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP |> > > QOS=bdw_fua_gwg2 DenyQos=bdw_qos_special AllowGroups=g2 |> > > |> > > So why does the UID-GID mapping take so long? |> > > |> > > @John: we defined many partitions on the same nodes but in the |> > > production cluster they will be more or less split across the 6K |> > > nodes. |> > > |> > > thank you very much |> > > ale |> > > |> > > ----- Original Message ----- |> > > > From: "John DeSantis" <desan...@usf.edu> |> > > > To: "Matthieu Hautreux" <matthieu.hautr...@gmail.com> |> > > > Cc: hpc-sysmgt-i...@cineca.it, "Slurm User Community List" |> > > > <slurm-users@lists.schedmd.com>, "Isabella Baccarelli" |> > > > <i.baccare...@cineca.it> Sent: Tuesday, January 16, 2018 |> > > > 8:20:20 |> > > > PM |> > > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on |> > > > send/recv operation |> > > > |> > > > Matthieu, |> > > > |> > > > > I would bet on something like LDAP requests taking too much |> > > > > time |> > > > > because of a missing sssd cache. |> > > > |> > > > Good point! It's easy to forget to check something as "simple" |> > > > as |> > > > user |> > > > look-up when something is taking "too long". |> > > > |> > > > John DeSantis |> > > > |> > > > On Tue, 16 Jan 2018 19:13:06 +0100 |> > > > Matthieu Hautreux <matthieu.hautr...@gmail.com> wrote: |> > > > |> > > > > Hi, |> > > > > |> > > > > In this kind if issues, one good thing to do is to get a |> > > > > backtrace |> > > > > of |> > > > > slurmctld during the slowdown. You should thus easily |> > > > > identify |> > > > > the |> > > > > subcomponent responsible for the issue. |> > > > > |> > > > > I would bet on something like LDAP requests taking too much |> > > > > time |> > > > > because of a missing sssd cache. |> > > > > |> > > > > Regards |> > > > > Matthieu |> > > > > |> > > > > Le 16 janv. 2018 18:59, "John DeSantis" <desan...@usf.edu> a |> > > > > écrit |> > > > > : |> > > > > |> > > > > > -----BEGIN PGP SIGNED MESSAGE----- |> > > > > > Hash: SHA512 |> > > > > > |> > > > > > Ciao Alessandro, |> > > > > > |> > > > > > > setting MessageTimeout to 20 didn't solve it :( |> > > > > > > |> > > > > > > looking at slurmctld logs I noticed many warning like |> > > > > > > these |> > > > > > > |> > > > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: |> > > > > > > Note |> > > > > > > very large processing time from |> > > > > > > _slurm_rpc_dump_partitions: |> > > > > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58 |> > > > > > > r000u17l01 |> > > > > > > slurmctld[22307]: Warning: Note very large processing |> > > > > > > time |> > > > > > > from load_part_uid_allow_list: usec=44861325 |> > > > > > > began=05:20:13.257 Jan 16 |> > > > > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very |> > > > > > > large |> > > > > > > processing time from _slurmctld_background: usec=44861653 |> > > > > > > began=05:20:13.257 |> > > > > > |> > > > > > And: |> > > > > > |> > > > > > > 271 Note very large processing time from |> > > > > > > _slurm_rpc_dump_partitions: 67 Note very large processing |> > > > > > > time |> > > > > > > from load_part_uid_allow_list: |> > > > > > |> > > > > > I believe these values are in microseconds, so an average |> > > > > > of |> > > > > > 44 |> > > > > > seconds per call, mostly related to partition information. |> > > > > > Given that our configuration has the maximum value set of |> > > > > > 90 |> > > > > > seconds, I'd |> > > > > > again recommend another adjustment, perhaps to 60 seconds. |> > > > > > |> > > > > > I'm not sure if redefining your partitions will help, but |> > > > > > you |> > > > > > do |> > > > > > have several partitions which contain the same set of nodes |> > > > > > that |> > > > > > could be condensed - decreasing the amount of partitions. |> > > > > > For |> > > > > > example, the partitions bdw_all_serial & bdw_all_rcm could |> > > > > > be |> > > > > > consolidated into a single partition by: |> > > > > > |> > > > > > 1.) Using AllowQOS=bdw_all_serial,bdw_all_rcm; |> > > > > > 2.) Setting MaxTime to 04:00:00 and defining a MaxWall via |> > > > > > each |> > > > > > QOS |> > > > > > (since one partition has 04:00:00 and the other 03:00:00). |> > > > > > |> > > > > > The same could be done for the partitions |> > > > > > skl_fua_{prod,bprod,lprod} as well. |> > > > > > |> > > > > > HTH, |> > > > > > John DeSantis |> > > > > > |> > > > > > |> > > > > > On Tue, 16 Jan 2018 11:22:44 +0100 |> > > > > > Alessandro Federico <a.feder...@cineca.it> wrote: |> > > > > > |> > > > > > > Hi, |> > > > > > > |> > > > > > > setting MessageTimeout to 20 didn't solve it :( |> > > > > > > |> > > > > > > looking at slurmctld logs I noticed many warning like |> > > > > > > these |> > > > > > > |> > > > > > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: |> > > > > > > Note |> > > > > > > very large processing time from |> > > > > > > _slurm_rpc_dump_partitions: |> > > > > > > usec=42850604 began=05:10:17.289 Jan 16 05:20:58 |> > > > > > > r000u17l01 |> > > > > > > slurmctld[22307]: Warning: Note very large processing |> > > > > > > time |> > > > > > > from load_part_uid_allow_list: usec=44861325 |> > > > > > > began=05:20:13.257 Jan 16 |> > > > > > > 05:20:58 r000u17l01 slurmctld[22307]: Warning: Note very |> > > > > > > large |> > > > > > > processing time from _slurmctld_background: usec=44861653 |> > > > > > > began=05:20:13.257 |> > > > > > > |> > > > > > > they are generated in many functions: |> > > > > > > |> > > > > > > [root@r000u17l01 ~]# journalctl -u slurmctld |> > > > > > > --since='2018-01-16 |> > > > > > > 00:00:00' | grep -oP 'Note very large processing time |> > > > > > > from |> > > > > > > \w+:' |> > > > > > > | sort | uniq -c 4 Note very large processing time from |> > > > > > > dump_all_job_state: 67 Note very large processing time |> > > > > > > from |> > > > > > > load_part_uid_allow_list: 67 Note very large processing |> > > > > > > time |> > > > > > > from |> > > > > > > _slurmctld_background: 7 Note very large processing time |> > > > > > > from |> > > > > > > _slurm_rpc_complete_batch_script: 4 Note very large |> > > > > > > processing |> > > > > > > time from _slurm_rpc_dump_jobs: 3 Note very large |> > > > > > > processing |> > > > > > > time |> > > > > > > from _slurm_rpc_dump_job_user: 271 Note very large |> > > > > > > processing |> > > > > > > time from _slurm_rpc_dump_partitions: 5 Note very large |> > > > > > > processing time from _slurm_rpc_epilog_complete: 1 Note |> > > > > > > very |> > > > > > > large processing time from |> > > > > > > _slurm_rpc_job_pack_alloc_info: |> > > > > > > 3 |> > > > > > > Note |> > > > > > > very large processing time from _slurm_rpc_step_complete: |> > > > > > > |> > > > > > > processing times are always around tens of seconds. |> > > > > > > |> > > > > > > I'm attaching sdiag output and slurm.conf. |> > > > > > > |> > > > > > > thanks |> > > > > > > ale |> > > > > > > |> > > > > > > ----- Original Message ----- |> > > > > > > > From: "Trevor Cooper" <tcoo...@sdsc.edu> |> > > > > > > > To: "Slurm User Community List" |> > > > > > > > <slurm-users@lists.schedmd.com> |> > > > > > > > Sent: Tuesday, January 16, 2018 12:10:21 AM |> > > > > > > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed |> > > > > > > > out |> > > > > > > > on send/recv operation |> > > > > > > > |> > > > > > > > Alessandro, |> > > > > > > > |> > > > > > > > You might want to consider tracking your Slurm |> > > > > > > > scheduler |> > > > > > > > diagnostics output with some type of time-series |> > > > > > > > monitoring |> > > > > > > > system. The time-based history has proven more helpful |> > > > > > > > at |> > > > > > > > times |> > > > > > > > than log contents by themselves. |> > > > > > > > |> > > > > > > > See Giovanni Torres' post on setting this up... |> > > > > > > > |> > > > > > > > http://giovannitorres.me/graphing-sdiag-with-graphite.html |> > > > > > > > |> > > > > > > > -- Trevor |> > > > > > > > |> > > > > > > > > On Jan 15, 2018, at 4:33 AM, Alessandro Federico |> > > > > > > > > <a.feder...@cineca.it> wrote: |> > > > > > > > > |> > > > > > > > > Hi John |> > > > > > > > > |> > > > > > > > > thanks for the info. |> > > > > > > > > slurmctld doesn't report anything about the server |> > > > > > > > > thread |> > > > > > > > > count in the logs |> > > > > > > > > and sdiag show only 3 server threads. |> > > > > > > > > |> > > > > > > > > We changed the MessageTimeout value to 20. |> > > > > > > > > |> > > > > > > > > I'll let you know if it solves the problem. |> > > > > > > > > |> > > > > > > > > Thanks |> > > > > > > > > ale |> > > > > > > > > |> > > > > > > > > ----- Original Message ----- |> > > > > > > > >> From: "John DeSantis" <desan...@usf.edu> |> > > > > > > > >> To: "Alessandro Federico" <a.feder...@cineca.it> |> > > > > > > > >> Cc: slurm-users@lists.schedmd.com, "Isabella |> > > > > > > > >> Baccarelli" |> > > > > > > > >> <i.baccare...@cineca.it>, hpc-sysmgt-i...@cineca.it |> > > > > > > > >> Sent: Friday, January 12, 2018 7:58:38 PM |> > > > > > > > >> Subject: Re: [slurm-users] slurm 17.11.2: Socket |> > > > > > > > >> timed |> > > > > > > > >> out on |> > > > > > > > >> send/recv operation |> > > > > > > > >> |> > > > > > > > >> Ciao Alessandro, |> > > > > > > > >> |> > > > > > > > >>> Do we have to apply any particular setting to avoid |> > > > > > > > >>> incurring the problem? |> > > > > > > > >> |> > > > > > > > >> What is your "MessageTimeout" value in slurm.conf? |> > > > > > > > >> If |> > > > > > > > >> it's |> > > > > > > > >> at the default of 10, try changing it to 20. |> > > > > > > > >> |> > > > > > > > >> I'd also check and see if the slurmctld log is |> > > > > > > > >> reporting |> > > > > > > > >> anything pertaining to the server thread count being |> > > > > > > > >> over |> > > > > > > > >> its limit. |> > > > > > > > >> |> > > > > > > > >> HTH, |> > > > > > > > >> John DeSantis |> > > > > > > > >> |> > > > > > > > >> On Fri, 12 Jan 2018 11:32:57 +0100 |> > > > > > > > >> Alessandro Federico <a.feder...@cineca.it> wrote: |> > > > > > > > >> |> > > > > > > > >>> Hi all, |> > > > > > > > >>> |> > > > > > > > >>> |> > > > > > > > >>> we are setting up SLURM 17.11.2 on a small test |> > > > > > > > >>> cluster |> > > > > > > > >>> of |> > > > > > > > >>> about 100 |> > > > > > > > >>> nodes. Sometimes we get the error in the subject |> > > > > > > > >>> when |> > > > > > > > >>> running any SLURM command (e.g. sinfo, squeue, |> > > > > > > > >>> scontrol |> > > > > > > > >>> reconf, etc...) |> > > > > > > > >>> |> > > > > > > > >>> |> > > > > > > > >>> Do we have to apply any particular setting to avoid |> > > > > > > > >>> incurring the problem? |> > > > > > > > >>> |> > > > > > > > >>> |> > > > > > > > >>> We found this bug report |> > > > > > > > >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but |> > > > > > > > >>> it |> > > > > > > > >>> regards the previous SLURM version and we do not |> > > > > > > > >>> set |> > > > > > > > >>> debug3 |> > > > > > > > >>> on slurmctld. |> > > > > > > > >>> |> > > > > > > > >>> |> > > > > > > > >>> thanks in advance |> > > > > > > > >>> ale |> > > > > > > > >>> |> > > > > > > > >> |> > > > > > > > >> |> > > > > > > > > |> > > > > > > > > -- |> > > > > > > > > Alessandro Federico |> > > > > > > > > HPC System Management Group |> > > > > > > > > System & Technology Department |> > > > > > > > > CINECA www.cineca.it |> > > > > > > > > Via dei Tizii 6, 00185 Rome - Italy |> > > > > > > > > phone: +39 06 44486708 |> > > > > > > > > |> > > > > > > > > All work and no play makes Jack a dull boy. |> > > > > > > > > All work and no play makes Jack a dull boy. |> > > > > > > > > All work and no play makes Jack... |> > > > > > > > > |> > > > > > > > |> > > > > > > > |> > > > > > > > |> > > > > > > |> > > > > > |> > > > > > -----BEGIN PGP SIGNATURE----- |> > > > > > Version: GnuPG v2 |> > > > > > |> > > > > > iQEcBAEBCgAGBQJaXjxzAAoJEEmckBqrs5nB9FQH/Rq6avZRXV0r1qQhSBH514J6 |> > > > > > vHWzGAgVSvBrpxFrtfu3aVTK6fk3bFahB9t2jtVJlg0HgO8dm3Gj6FMNo0nDyemD |> > > > > > NlIePvvXGwZYXeXlif+OtCTu/3fOqvuol1jX8/iXcG89Lm+HA92BhLKPYoqzWsK4 |> > > > > > KQ/m8Mlj91Ei3GRZorZfyZrRrfAYNatIV2plmRaGWmuH39MEwQ0bF/qQhci/LAXB |> > > > > > xquAZWAVeSE1uWThXPS4sbzmHjNuenT9RqlGtgQOEMO4z/bHFQwmMVuxqfmS537h |> > > > > > /93icpAcWhJQ1bYe51ePykWk3Jkv901Z7Cr6bG1+hu2asN1loFzz38YugHUcfBs= |> > > > > > =VWA7 |> > > > > > -----END PGP SIGNATURE----- |> > > > > > |> > > > |> > > > |> > > > |> > > |> > |> > |> |> -- |> Alessandro Federico |> HPC System Management Group |> System & Technology Department |> CINECA www.cineca.it |> Via dei Tizii 6, 00185 Rome - Italy |> phone: +39 06 44486708 |> |> All work and no play makes Jack a dull boy. |> All work and no play makes Jack a dull boy. |> All work and no play makes Jack... |> |> | | -- | Alessandro Federico | HPC System Management Group | System & Technology Department | CINECA www.cineca.it | Via dei Tizii 6, 00185 Rome - Italy | phone: +39 06 44486708 | | All work and no play makes Jack a dull boy. | All work and no play makes Jack a dull boy. | All work and no play makes Jack... -- James A. Peltier Manager - Research Computing Group | IT Services Simon Fraser University | ASB10973 8888 University Dr., Burnaby, B.C. V5A 1S6 T: 778.782.6573 | M: 604.365.6432 | sfu.ca/itservices Twitter: @sfu_it