[Freeipa-users] Re: Join command 500 errors, timeouts

Alfred Victor via FreeIPA-users Thu, 10 Jun 2021 07:47:06 -0700

Hi all,

Just curious if anyone has suggestions about that please before I get going
in a couple of hours with conversions to IPA again? I did the math and
1,097,471 log messages in 5 hours is about 60 times per second, so I'm
gradually becoming more certain this is why we can only boot 20-30 nodes at
a time when we used to boot hundreds. However, this is still just a guess
as I don't know the mechanism behind why this interferes with IPA joins,
some bottleneck with the KDC?


Alfred

On Wed, Jun 9, 2021 at 3:49 PM Alfred Victor <alvic...@gmail.com> wrote:

> Hi Rob,
>
> I have reduced that timeout and will tune it further. Regarding ISE
> errors, I think we can make the assumption that this is entirely an issue
> of the web timeouts, I haven't seen any evidence otherwise and will have
> another attempt at converting nodes tomorrow, and with a keener eye of what
> to look for I can make a better determination then. I am most concerned
> over what the underlying cause might be causing it to take too long and hit
> the timeout, and don't want to engineer around this by changing Apache
> timeouts if we can instead address the root cause. I am suspicious of krb
> log messages flooding our IPA systems about a service principal like so,
> but not sure if this is gumming up the works (or even why this message has
> started appearing since the rebuild):
>
> ./krb5kdc.log:Jun 09 10:38:51 redacted.redacted.com krb5kdc[31187](info): 
> TGS_REQ (4 etypes {18 17 16 23}) 10.1.1.27: LOOKING_UP_SERVER: authtime 0,  
> host/redacted.redacted....@redacted.com for 
> nfs/nfsserver.redacted....@redacted.com, Server not found in Kerberos database
>
>
> Just in the last 5 hours alone, this log message and others like it (main 
> difference is just the nodename it references) has appeared 1,097,471 times. 
> Conceivably there is also some log write locking or something going on that 
> could be slowing IPA down and leading to our symptom here?
>
> Alfred
>
>
> On Wed, Jun 9, 2021 at 9:19 AM Rob Crittenden <rcrit...@redhat.com> wrote:
>
>> Alfred Victor wrote:
>> > Hi Rob,
>> >
>> > We did revert to 60s - I seem to remember some ldapsearch timing out
>> > previously but maybe we could still greatly reduce this with no ill
>> > effect. However, we saw no change in join success either way and I have
>> > not changed anything in Apache as I would need to find the exact values
>> > in question and I think these directive changes may get lost with an
>> > update? The timeout/ISE issues are new - we previously booted many nodes
>> > concurrently (400+) without problems but now it happens even booting as
>> > few as 50 results in 10-20 timeouts or ISE. This has been the case at
>> > least since we rebuilt and re-replicated the environment by swapping IPA
>> > members out of the old one, but maybe this is unrelated. Is it possible
>> > we landed on a newer version with different timeout values? Is it
>> > possible that IPA is under slightly more load from the higher number of
>> > nodes and has some bottleneck since we last did conversions? We have not
>> > been able to substantiate the latter by looking at CPU/memory/io trends.
>> > What might we investigate next to see if we may have missed some ongoing
>> > issue that could for instance cause locking problems or something else
>> > internally in IPA to explain our symptoms? Also, I have observed some
>> > nodes which report a successful installation of IPA client, but have in
>> > fact a lot of failures, for instance with mounts not working from
>> > automount setup. We will need to try to reproduce this to understand
>> > what happened I think as we have already sorted those nodes elsewhere
>> > and they have got going outside of IPA.
>> >
>> > I am interested that you suggested "at least some of the clients aren't
>> > connecting at all and increasing the timeout could make this worse" -
>> > this might indicate some sort of network problem at play, but as far as
>> > I am aware, everything is working absent IPA so I do not suspect it
>> > presently.
>>
>> A 60 second LDAP timeout is still way too big. The default is 2. Unless
>>  you are seeing timeouts I'd suggest lowering it.
>>
>> I'm only seeing hints to what you're seeing and the scope so it's hard
>> to make further suggestions. Are all the internal errors the same, for
>> example? Are some failing due to LDAP timeouts or are they all wsgi read
>> timeouts?
>>
>> The Apache request timeout can be configured in httpd.conf which is
>> independent of the config files that IPA currently writes so it should
>> survive upgrades.
>>
>> rob
>> >
>> > Alfred
>> >
>> > On Mon, Jun 7, 2021 at 4:42 PM Rob Crittenden <rcrit...@redhat.com
>> > <mailto:rcrit...@redhat.com>> wrote:
>> >
>> >     Alfred Victor wrote:
>> >     > Actually, no change happened from 300-> 600 timeout, the web
>> portal
>> >     > itself gave me an ISE I hadn't noticed when I tried clicking save!
>> >
>> >     I wasn't clear which log to look in. You'll see details about where
>> the
>> >      error is caught in IPA in the Apache log. To see LDAP timeouts you
>> look
>> >     for err=3 in the 389-ds access log.
>> >
>> >     But since you posted a traceback this time it's clear that this is
>> just
>> >     Apache waiting for client data to read so any tuning you do needs
>> to be
>> >     in Apache. You could try tuning Timeout which defaults to 60 but
>> this
>> >     doesn't seem likely to help since at least some of the clients
>> aren't
>> >     connecting at all and increasing the timeout could make this worse.
>> >
>> >     Please revert the searchtimeout. 300 seconds is orders of magnitude
>> >     too big.
>> >
>> >     rob
>> >
>> >     >
>> >     > Alfred
>> >     >
>> >     > On Mon, Jun 7, 2021 at 3:57 PM Alfred Victor <alvic...@gmail.com
>> >     <mailto:alvic...@gmail.com>
>> >     > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>> wrote:
>> >     >
>> >     >     Hi FreeIPA list,
>> >     >
>> >     >     I don't see any in error log that match `grep -i "err=3"
>> >     >     /var/log/httpd/error_log`. We have tried raising
>> >     searchtimelimit as
>> >     >     high as 120, then 300 (now are trying 600) but observed no
>> >     >     difference in the rate at which nodes succeeded or failed in
>> IPA
>> >     >     joins. We are somewhat puzzled by this, as none of the other
>> >     values
>> >     >     we are aware of might have changed, though it is possible
>> that the
>> >     >     IPA systems are under a little higher demand from client
>> >     systems, we
>> >     >     have tried to mitigate this by shutting down some workflows
>> and
>> >     >     aren't sure whether we've seen any improvement. Short of
>> adjusting
>> >     >     apache/resource/process timeouts it is difficult to say what
>> might
>> >     >     be wrong. To give an example, out of 250 nodes rebooted, only
>> 112
>> >     >     joined IPA successfully. Here is some output from the error
>> log,
>> >     >     following what this looks like in the ipa client install log
>> >     (error
>> >     >     log output will match the node attempt):
>> >     >
>> >     >
>> >     >     2021-06-07T18:25:30Z DEBUG The ipa-client-install command
>> >     failed, exception: NetworkError: cannot connect to
>> >     'https://redactednode.com/ipa/json
>> >     <https://hauth0004.dug.com/ipa/json>': Internal Server Error
>> >     >
>> >     >     [Mon Jun 07 13:25:06.198259 2021] [core:error] [pid 25020]
>> >     [client 10.1.24.48:47808 <http://10.1.24.48:47808>
>> >     <http://10.1.24.48:47808>] Script timed out before returning
>> >     headers: wsgi.py, referer: https://redacted.redacted.com/ipa/xml
>> >     <https://hauth0004.dug.com/ipa/xml>
>> >     >
>> >     >     Different node, same time period:
>> >     >
>> >     >
>> >     >     [Mon Jun 07 13:24:02.178092 2021] [:error] [pid 25725] ipa:
>> >     INFO: [xmlserver] mach_j...@redacted.com: join(u'redacted.node.com
>> >     <http://redacted.node.com> <http://redacted.node.com>',
>> >     nshardwareplatform=u'x86_64',
>> >     nsosversion=u'3.10.0-1062.18.1.1.el7.redacted.x86_64',
>> >     version=u'2.51'): TimeLimitExceeded
>> >     >
>> >     >     I also saw this:
>> >     >
>> >     >     [Mon Jun 07 13:25:07.103503 2021] [:error] [pid 25725] ipa:
>> >     ERROR: non-public: IOError: request data read error
>> >     >     [Mon Jun 07 13:25:07.103529 2021] [:error] [pid 25725]
>> >     Traceback (most recent call last):
>> >     >     [Mon Jun 07 13:25:07.103536 2021] [:error] [pid 25725]   File
>> >     "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 360,
>> >     in wsgi_execute
>> >     >     [Mon Jun 07 13:25:07.103542 2021] [:error] [pid 25725]
>> >      data = read_input(environ)
>> >     >     [Mon Jun 07 13:25:07.103548 2021] [:error] [pid 25725]   File
>> >     "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 200,
>> >     in read_input
>> >     >     [Mon Jun 07 13:25:07.103553 2021] [:error] [pid 25725]
>> >      return environ['wsgi.input'].read(length).decode('utf-8')
>> >     >     [Mon Jun 07 13:25:07.103559 2021] [:error] [pid 25725]
>> >     IOError: request data read error
>> >     >     [Mon Jun 07 13:25:07.103826 2021] [:error] [pid 25725] ipa:
>> >     INFO: [xmlserver] mach_j...@redacted.com: None: InternalError
>> >     >     [Mon Jun 07 13:25:07.149962 2021] [:error] [pid 25726] ipa:
>> >     ERROR: non-public: IOError: request data read error
>> >     >     [Mon Jun 07 13:25:07.149984 2021] [:error] [pid 25726]
>> >     Traceback (most recent call last):
>> >     >     [Mon Jun 07 13:25:07.149991 2021] [:error] [pid 25726]   File
>> >     "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 360,
>> >     in wsgi_execute
>> >     >     [Mon Jun 07 13:25:07.149997 2021] [:error] [pid 25726]
>> >      data = read_input(environ)
>> >     >     [Mon Jun 07 13:25:07.150002 2021] [:error] [pid 25726]   File
>> >     "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line 200,
>> >     in read_input
>> >     >     [Mon Jun 07 13:25:07.150008 2021] [:error] [pid 25726]
>> >      return environ['wsgi.input'].read(length).decode('utf-8')
>> >     >     [Mon Jun 07 13:25:07.150013 2021] [:error] [pid 25726]
>> >     IOError: request data read error
>> >     >
>> >     >
>> >     >
>> >     >     After setting the timeout to 600 and rebooting the remaining
>> >     139 nodes from the initial set of 250, 83 joined of the 139 and we
>> >     still had ISE occurring. In some cases, it would ISE on the first
>> >     attempt, try another IPA system, and succeed. I'm not sure that even
>> >     such a long timeout as 600 has helped.
>> >     >
>> >     >     Alfred
>> >     >
>> >     >
>> >     >
>> >     >
>> >     >     On Thu, Jun 3, 2021 at 7:51 PM Rob Crittenden
>> >     <rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>> >     >     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>
>> wrote:
>> >     >
>> >     >         Alfred Victor via FreeIPA-users wrote:
>> >     >         > Hi FreeIPA list,
>> >     >         >
>> >     >         > We are having an issue with our IPA environment of 4
>> >     >         replicated FreeIPA
>> >     >         > systems serving linux compute clients which join from a
>> >     command in
>> >     >         > rc.local after boot. This worked in the past, but the
>> system
>> >     >         has been
>> >     >         > rebuilt since and the join command changed
>> >     slightly. Unfortunately
>> >     >         > booting a few dozen nodes at a time, though they each
>> >     talk to a
>> >     >         > different IPA system by design, leads to problems such
>> as
>> >     >         these - though
>> >     >         > 40-100 nodes can boot ok at a time there are always many
>> >     >         stragglers, and
>> >     >         > the more we attempt to boot at once the more fail to
>> >     join IPA
>> >     >         (if we try
>> >     >         > to boot 500 nodes, we are lucky if we get a fifth of
>> that
>> >     >         joining IPA).
>> >     >         > Can you please advise on this output? Here is our join
>> >     command in
>> >     >         > compute node rc.local:
>> >     >         >
>> >     >         >
>> >     >         >
>> >     >         >     ipa-client-install -U -q -p mach_join \
>> >     >         >     -w <redacted> \
>> >     >         >     --force-join \
>> >     >         >     --no-dns-sshfp \
>> >     >         >     --automount-location=redacted-node
>> >     >         >
>> >     >         >
>> >     >         >
>> >     >         >
>> >     >         > And here is some log output of the 500 error:
>> >     >         >
>> >     >         >     ProtocolError: <ProtocolError for
>> >     >         redacted.redacted.com/ipa/json
>> >     <http://redacted.redacted.com/ipa/json>
>> >     >         <http://redacted.redacted.com/ipa/json>
>> >     >         <http://redacted.redacted.com/ipa/json>: 500 Internal
>> >     Server Error>
>> >     >         >     Cannot connect to the server due to generic error:
>> >     cannot
>> >     >         connect to 'https://redacted.redacted.com/ipa/json
>> >     >         <https://hauth0003.dug.com/ipa/json>': Internal Server
>> Error
>> >     >         >
>> >     >         >
>> >     >         > As well as:
>> >     >         >
>> >     >         >     2021-06-02T21:39:11Z DEBUG Starting external process
>> >     >         >     2021-06-02T21:39:11Z DEBUG args=/usr/sbin/ipa-join
>> -s
>> >     >         >     redacted.redacted.com <http://redacted.redacted.com
>> >
>> >     <http://redacted.redacted.com>
>> >     >         <http://redacted.redacted.com> -b
>> >     >         >     dc=redacted,dc=com -h redactednode.redacted.com
>> >     <http://redactednode.redacted.com>
>> >     >         <http://redactednode.redacted.com>
>> >     >         >     <http://redactednode.redacted.com> -f
>> >     >         >     2021-06-02T21:40:13Z DEBUG Process finished, return
>> >     code=17
>> >     >         >     2021-06-02T21:40:13Z DEBUG stdout=
>> >     >         >     2021-06-02T21:40:13Z DEBUG stderr=HTTP response
>> code is
>> >     >         500, not 200
>> >     >         >     2021-06-02T21:40:13Z ERROR Joining realm failed:
>> HTTP
>> >     >         response code
>> >     >         >     is 500, not 200
>> >     >         >
>> >     >         > And we also see timeouts happen:
>> >     >         >
>> >     >         >
>> >     >         >     2021-06-02T22:08:50Z DEBUG args=/usr/sbin/ipa-join
>> -s
>> >     >         >     redacted.redacted.com <http://redacted.redacted.com
>> >
>> >     <http://redacted.redacted.com>
>> >     >         <http://redacted.redacted.com> -b
>> >     >         >     dc=redacted,dc=com -h redactednode.redacted.com
>> >     <http://redactednode.redacted.com>
>> >     >         <http://redactednode.redacted.com>
>> >     >         >     <http://redactednode.redacted.com> -f
>> >     >         >     2021-06-02T22:09:01Z DEBUG Process finished, return
>> >     code=17
>> >     >         >     2021-06-02T22:09:01Z DEBUG stdout=
>> >     >         >     2021-06-02T22:09:01Z DEBUG stderr=RPC failed at
>> server.
>> >     >         Configured
>> >     >         >     time limit exceeded
>> >     >         >     2021-06-02T22:09:01Z ERROR Joining realm failed: RPC
>> >     failed at
>> >     >         >     server. Configured time limit exceeded
>> >     >         >
>> >     >         >
>> >     >         > And we also see later timeouts near the end of the log
>> >     in some
>> >     >         cases though are able to authenticate and it didn't back
>> >     out the
>> >     >         install, but never got going healthy either:
>> >     >         >
>> >     >         >
>> >     >         >
>> >     >         >     2021-06-03T19:20:13Z DEBUG The ipa-client-install
>> >     command
>> >     >         failed,
>> >     >         >     exception: TimeLimitExceeded: Configured time limit
>> >     exceeded
>> >     >
>> >     >         When you see Internal Error look to the Apache error log
>> >     on the
>> >     >         server
>> >     >         for more information.
>> >     >
>> >     >         In this case an LDAP search is failing because the server
>> >     is too
>> >     >         busy.
>> >     >         Look for queries failing with err=3 to get an idea of how
>> long
>> >     >         it is taking.
>> >     >
>> >     >         To increase the timeout use: ipa config-mod
>> >     --searchtimelimit=INT
>> >     >
>> >     >         The default is 2 seconds.
>> >     >
>> >     >         You can pick a time at random but could see failures
>> again.
>> >     >
>> >     >         rob
>> >     >
>> >
>>
>>

_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

[Freeipa-users] Re: Join command 500 errors, timeouts

Reply via email to