Hi all, Just curious if anyone has suggestions about that please before I get going in a couple of hours with conversions to IPA again? I did the math and 1,097,471 log messages in 5 hours is about 60 times per second, so I'm gradually becoming more certain this is why we can only boot 20-30 nodes at a time when we used to boot hundreds. However, this is still just a guess as I don't know the mechanism behind why this interferes with IPA joins, some bottleneck with the KDC?
Alfred On Wed, Jun 9, 2021 at 3:49 PM Alfred Victor <alvic...@gmail.com> wrote: > Hi Rob, > > I have reduced that timeout and will tune it further. Regarding ISE > errors, I think we can make the assumption that this is entirely an issue > of the web timeouts, I haven't seen any evidence otherwise and will have > another attempt at converting nodes tomorrow, and with a keener eye of what > to look for I can make a better determination then. I am most concerned > over what the underlying cause might be causing it to take too long and hit > the timeout, and don't want to engineer around this by changing Apache > timeouts if we can instead address the root cause. I am suspicious of krb > log messages flooding our IPA systems about a service principal like so, > but not sure if this is gumming up the works (or even why this message has > started appearing since the rebuild): > > ./krb5kdc.log:Jun 09 10:38:51 redacted.redacted.com krb5kdc[31187](info): > TGS_REQ (4 etypes {18 17 16 23}) 10.1.1.27: LOOKING_UP_SERVER: authtime 0, > host/redacted.redacted....@redacted.com for > nfs/nfsserver.redacted....@redacted.com, Server not found in Kerberos database > > > Just in the last 5 hours alone, this log message and others like it (main > difference is just the nodename it references) has appeared 1,097,471 times. > Conceivably there is also some log write locking or something going on that > could be slowing IPA down and leading to our symptom here? > > Alfred > > > On Wed, Jun 9, 2021 at 9:19 AM Rob Crittenden <rcrit...@redhat.com> wrote: > >> Alfred Victor wrote: >> > Hi Rob, >> > >> > We did revert to 60s - I seem to remember some ldapsearch timing out >> > previously but maybe we could still greatly reduce this with no ill >> > effect. However, we saw no change in join success either way and I have >> > not changed anything in Apache as I would need to find the exact values >> > in question and I think these directive changes may get lost with an >> > update? The timeout/ISE issues are new - we previously booted many nodes >> > concurrently (400+) without problems but now it happens even booting as >> > few as 50 results in 10-20 timeouts or ISE. This has been the case at >> > least since we rebuilt and re-replicated the environment by swapping IPA >> > members out of the old one, but maybe this is unrelated. Is it possible >> > we landed on a newer version with different timeout values? Is it >> > possible that IPA is under slightly more load from the higher number of >> > nodes and has some bottleneck since we last did conversions? We have not >> > been able to substantiate the latter by looking at CPU/memory/io trends. >> > What might we investigate next to see if we may have missed some ongoing >> > issue that could for instance cause locking problems or something else >> > internally in IPA to explain our symptoms? Also, I have observed some >> > nodes which report a successful installation of IPA client, but have in >> > fact a lot of failures, for instance with mounts not working from >> > automount setup. We will need to try to reproduce this to understand >> > what happened I think as we have already sorted those nodes elsewhere >> > and they have got going outside of IPA. >> > >> > I am interested that you suggested "at least some of the clients aren't >> > connecting at all and increasing the timeout could make this worse" - >> > this might indicate some sort of network problem at play, but as far as >> > I am aware, everything is working absent IPA so I do not suspect it >> > presently. >> >> A 60 second LDAP timeout is still way too big. The default is 2. Unless >> you are seeing timeouts I'd suggest lowering it. >> >> I'm only seeing hints to what you're seeing and the scope so it's hard >> to make further suggestions. Are all the internal errors the same, for >> example? Are some failing due to LDAP timeouts or are they all wsgi read >> timeouts? >> >> The Apache request timeout can be configured in httpd.conf which is >> independent of the config files that IPA currently writes so it should >> survive upgrades. >> >> rob >> > >> > Alfred >> > >> > On Mon, Jun 7, 2021 at 4:42 PM Rob Crittenden <rcrit...@redhat.com >> > <mailto:rcrit...@redhat.com>> wrote: >> > >> > Alfred Victor wrote: >> > > Actually, no change happened from 300-> 600 timeout, the web >> portal >> > > itself gave me an ISE I hadn't noticed when I tried clicking save! >> > >> > I wasn't clear which log to look in. You'll see details about where >> the >> > error is caught in IPA in the Apache log. To see LDAP timeouts you >> look >> > for err=3 in the 389-ds access log. >> > >> > But since you posted a traceback this time it's clear that this is >> just >> > Apache waiting for client data to read so any tuning you do needs >> to be >> > in Apache. You could try tuning Timeout which defaults to 60 but >> this >> > doesn't seem likely to help since at least some of the clients >> aren't >> > connecting at all and increasing the timeout could make this worse. >> > >> > Please revert the searchtimeout. 300 seconds is orders of magnitude >> > too big. >> > >> > rob >> > >> > > >> > > Alfred >> > > >> > > On Mon, Jun 7, 2021 at 3:57 PM Alfred Victor <alvic...@gmail.com >> > <mailto:alvic...@gmail.com> >> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>> wrote: >> > > >> > > Hi FreeIPA list, >> > > >> > > I don't see any in error log that match `grep -i "err=3" >> > > /var/log/httpd/error_log`. We have tried raising >> > searchtimelimit as >> > > high as 120, then 300 (now are trying 600) but observed no >> > > difference in the rate at which nodes succeeded or failed in >> IPA >> > > joins. We are somewhat puzzled by this, as none of the other >> > values >> > > we are aware of might have changed, though it is possible >> that the >> > > IPA systems are under a little higher demand from client >> > systems, we >> > > have tried to mitigate this by shutting down some workflows >> and >> > > aren't sure whether we've seen any improvement. Short of >> adjusting >> > > apache/resource/process timeouts it is difficult to say what >> might >> > > be wrong. To give an example, out of 250 nodes rebooted, only >> 112 >> > > joined IPA successfully. Here is some output from the error >> log, >> > > following what this looks like in the ipa client install log >> > (error >> > > log output will match the node attempt): >> > > >> > > >> > > 2021-06-07T18:25:30Z DEBUG The ipa-client-install command >> > failed, exception: NetworkError: cannot connect to >> > 'https://redactednode.com/ipa/json >> > <https://hauth0004.dug.com/ipa/json>': Internal Server Error >> > > >> > > [Mon Jun 07 13:25:06.198259 2021] [core:error] [pid 25020] >> > [client 10.1.24.48:47808 <http://10.1.24.48:47808> >> > <http://10.1.24.48:47808>] Script timed out before returning >> > headers: wsgi.py, referer: https://redacted.redacted.com/ipa/xml >> > <https://hauth0004.dug.com/ipa/xml> >> > > >> > > Different node, same time period: >> > > >> > > >> > > [Mon Jun 07 13:24:02.178092 2021] [:error] [pid 25725] ipa: >> > INFO: [xmlserver] mach_j...@redacted.com: join(u'redacted.node.com >> > <http://redacted.node.com> <http://redacted.node.com>', >> > nshardwareplatform=u'x86_64', >> > nsosversion=u'3.10.0-1062.18.1.1.el7.redacted.x86_64', >> > version=u'2.51'): TimeLimitExceeded >> > > >> > > I also saw this: >> > > >> > > [Mon Jun 07 13:25:07.103503 2021] [:error] [pid 25725] ipa: >> > ERROR: non-public: IOError: request data read error >> > > [Mon Jun 07 13:25:07.103529 2021] [:error] [pid 25725] >> > Traceback (most recent call last): >> > > [Mon Jun 07 13:25:07.103536 2021] [:error] [pid 25725] File >> > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 360, >> > in wsgi_execute >> > > [Mon Jun 07 13:25:07.103542 2021] [:error] [pid 25725] >> > data = read_input(environ) >> > > [Mon Jun 07 13:25:07.103548 2021] [:error] [pid 25725] File >> > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 200, >> > in read_input >> > > [Mon Jun 07 13:25:07.103553 2021] [:error] [pid 25725] >> > return environ['wsgi.input'].read(length).decode('utf-8') >> > > [Mon Jun 07 13:25:07.103559 2021] [:error] [pid 25725] >> > IOError: request data read error >> > > [Mon Jun 07 13:25:07.103826 2021] [:error] [pid 25725] ipa: >> > INFO: [xmlserver] mach_j...@redacted.com: None: InternalError >> > > [Mon Jun 07 13:25:07.149962 2021] [:error] [pid 25726] ipa: >> > ERROR: non-public: IOError: request data read error >> > > [Mon Jun 07 13:25:07.149984 2021] [:error] [pid 25726] >> > Traceback (most recent call last): >> > > [Mon Jun 07 13:25:07.149991 2021] [:error] [pid 25726] File >> > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 360, >> > in wsgi_execute >> > > [Mon Jun 07 13:25:07.149997 2021] [:error] [pid 25726] >> > data = read_input(environ) >> > > [Mon Jun 07 13:25:07.150002 2021] [:error] [pid 25726] File >> > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 200, >> > in read_input >> > > [Mon Jun 07 13:25:07.150008 2021] [:error] [pid 25726] >> > return environ['wsgi.input'].read(length).decode('utf-8') >> > > [Mon Jun 07 13:25:07.150013 2021] [:error] [pid 25726] >> > IOError: request data read error >> > > >> > > >> > > >> > > After setting the timeout to 600 and rebooting the remaining >> > 139 nodes from the initial set of 250, 83 joined of the 139 and we >> > still had ISE occurring. In some cases, it would ISE on the first >> > attempt, try another IPA system, and succeed. I'm not sure that even >> > such a long timeout as 600 has helped. >> > > >> > > Alfred >> > > >> > > >> > > >> > > >> > > On Thu, Jun 3, 2021 at 7:51 PM Rob Crittenden >> > <rcrit...@redhat.com <mailto:rcrit...@redhat.com> >> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> >> wrote: >> > > >> > > Alfred Victor via FreeIPA-users wrote: >> > > > Hi FreeIPA list, >> > > > >> > > > We are having an issue with our IPA environment of 4 >> > > replicated FreeIPA >> > > > systems serving linux compute clients which join from a >> > command in >> > > > rc.local after boot. This worked in the past, but the >> system >> > > has been >> > > > rebuilt since and the join command changed >> > slightly. Unfortunately >> > > > booting a few dozen nodes at a time, though they each >> > talk to a >> > > > different IPA system by design, leads to problems such >> as >> > > these - though >> > > > 40-100 nodes can boot ok at a time there are always many >> > > stragglers, and >> > > > the more we attempt to boot at once the more fail to >> > join IPA >> > > (if we try >> > > > to boot 500 nodes, we are lucky if we get a fifth of >> that >> > > joining IPA). >> > > > Can you please advise on this output? Here is our join >> > command in >> > > > compute node rc.local: >> > > > >> > > > >> > > > >> > > > ipa-client-install -U -q -p mach_join \ >> > > > -w <redacted> \ >> > > > --force-join \ >> > > > --no-dns-sshfp \ >> > > > --automount-location=redacted-node >> > > > >> > > > >> > > > >> > > > >> > > > And here is some log output of the 500 error: >> > > > >> > > > ProtocolError: <ProtocolError for >> > > redacted.redacted.com/ipa/json >> > <http://redacted.redacted.com/ipa/json> >> > > <http://redacted.redacted.com/ipa/json> >> > > <http://redacted.redacted.com/ipa/json>: 500 Internal >> > Server Error> >> > > > Cannot connect to the server due to generic error: >> > cannot >> > > connect to 'https://redacted.redacted.com/ipa/json >> > > <https://hauth0003.dug.com/ipa/json>': Internal Server >> Error >> > > > >> > > > >> > > > As well as: >> > > > >> > > > 2021-06-02T21:39:11Z DEBUG Starting external process >> > > > 2021-06-02T21:39:11Z DEBUG args=/usr/sbin/ipa-join >> -s >> > > > redacted.redacted.com <http://redacted.redacted.com >> > >> > <http://redacted.redacted.com> >> > > <http://redacted.redacted.com> -b >> > > > dc=redacted,dc=com -h redactednode.redacted.com >> > <http://redactednode.redacted.com> >> > > <http://redactednode.redacted.com> >> > > > <http://redactednode.redacted.com> -f >> > > > 2021-06-02T21:40:13Z DEBUG Process finished, return >> > code=17 >> > > > 2021-06-02T21:40:13Z DEBUG stdout= >> > > > 2021-06-02T21:40:13Z DEBUG stderr=HTTP response >> code is >> > > 500, not 200 >> > > > 2021-06-02T21:40:13Z ERROR Joining realm failed: >> HTTP >> > > response code >> > > > is 500, not 200 >> > > > >> > > > And we also see timeouts happen: >> > > > >> > > > >> > > > 2021-06-02T22:08:50Z DEBUG args=/usr/sbin/ipa-join >> -s >> > > > redacted.redacted.com <http://redacted.redacted.com >> > >> > <http://redacted.redacted.com> >> > > <http://redacted.redacted.com> -b >> > > > dc=redacted,dc=com -h redactednode.redacted.com >> > <http://redactednode.redacted.com> >> > > <http://redactednode.redacted.com> >> > > > <http://redactednode.redacted.com> -f >> > > > 2021-06-02T22:09:01Z DEBUG Process finished, return >> > code=17 >> > > > 2021-06-02T22:09:01Z DEBUG stdout= >> > > > 2021-06-02T22:09:01Z DEBUG stderr=RPC failed at >> server. >> > > Configured >> > > > time limit exceeded >> > > > 2021-06-02T22:09:01Z ERROR Joining realm failed: RPC >> > failed at >> > > > server. Configured time limit exceeded >> > > > >> > > > >> > > > And we also see later timeouts near the end of the log >> > in some >> > > cases though are able to authenticate and it didn't back >> > out the >> > > install, but never got going healthy either: >> > > > >> > > > >> > > > >> > > > 2021-06-03T19:20:13Z DEBUG The ipa-client-install >> > command >> > > failed, >> > > > exception: TimeLimitExceeded: Configured time limit >> > exceeded >> > > >> > > When you see Internal Error look to the Apache error log >> > on the >> > > server >> > > for more information. >> > > >> > > In this case an LDAP search is failing because the server >> > is too >> > > busy. >> > > Look for queries failing with err=3 to get an idea of how >> long >> > > it is taking. >> > > >> > > To increase the timeout use: ipa config-mod >> > --searchtimelimit=INT >> > > >> > > The default is 2 seconds. >> > > >> > > You can pick a time at random but could see failures >> again. >> > > >> > > rob >> > > >> > >> >>
_______________________________________________ FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure