[Freeipa-users] Re: Join command 500 errors, timeouts

Rob Crittenden via FreeIPA-users Tue, 15 Jun 2021 13:19:30 -0700

Alfred Victor wrote:
> I don't see a directive equivalent of SECURE_NFS to add to nfs.conf (all
> documentation seems to still refer to the sysconfig path), or is it the
> same? Can I just disable rpcgssd? We have no nfs mounts which are
> kerberized yet, and disabling rpcgssd seems to solve our problem, and I
> can kinit after disabling rpcgssd. It also does not seem that disabling
> rpcgssd hurts running tasks or node, but would like to confirm it's
> limited to nfs in function. I still have to wonder if there's a better way.


I suspect that SECURE_NFS is a no-op these days. It was necessary in
RHEL 5/6 for sure.

Yes, I think you can safely disable the service. In some versions of
RHEL/Fedora IIRC this is symlinked to nfs-secure-server so mask that as
well.

rob

> 
> Alfred
> 
> On Tue, Jun 15, 2021 at 10:31 AM Alexander Bokovoy <aboko...@redhat.com
> <mailto:aboko...@redhat.com>> wrote:
> 
>     On ti, 15 kesä 2021, Alfred Victor via FreeIPA-users wrote:
>     >Hi Rob,
>     >
>     >We attempted setting sec=sys on the mount, however to our surprise
>     found
>     >this didn't work. We then figured out that IPA install is adding
>     this to
>     >/etc/sysconfig/nfs:
>     >
>     >SECURE_NFS=yes
>     >
>     >
>     >We tried removing this to no avail and restarting all the related
>     sytstemd
>     >units (rpcgssd, nfs, etc). Any idea why sec=sys is being ignored?
>     Should we
>     >need to set SECURE_NFS=no? On non-IPA nodes this directive does not
>     exist
>     >at all. For now, I have also totally disabled rpcgssd as I think
>     this unit
>     >may be responsible ( it seems that it does the upcall in
>     >https://access.redhat.com/solutions/225783 ) so I will hope that this
>     >solves it, I don't believe anything else depends on rpcgssd but
>     will soon
>     >find out. Any suggestions please? :)
> 
>     Depends on nfs-utils version? I remember there had been a change in
>     configuration in upstream nfs-utils in 2019:
> 
>     commit c69875c8afdd877baf7139c0cd5241f70105cbd4
>     Author: François Cami <fc...@redhat.com <mailto:fc...@redhat.com>>
>     Date:   Tue Feb 26 13:59:06 2019 +0100
> 
>          ipa-client-automount: handle NFS configuration file changes
> 
>          nfs-utils in Fedora 30 and later switched its configuration
>          file from /etc/sysconfig/nfs to /etc/nfs.conf, providing a
>          conversion service (nfs-convert.service) for upgrades.
>          However, for new installs the original configuration file
>          is missing. This change:
>          * adds a tuple-based osinfo.version_number method to handle
>            more kinds of OS versioning schemes
>          * detects RHEL and Fedora versions with the the new nfs-utils
>            behavior
>          * avoids backing up the new NFS configuration file as we do
>            not have to modify it.
> 
>          See: https://bugzilla.redhat.com/show_bug.cgi?id=1676981
> 
>          Fixes: https://pagure.io/freeipa/issue/7868
> 
> 
>     >
>     >Alfred
>     >
>     >On Thu, Jun 10, 2021 at 2:17 PM Rob Crittenden <rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>> wrote:
>     >
>     >> Alfred Victor wrote:
>     >> > Thanks very much Rob et al! I believe we have found our root
>     cause and
>     >> > the fix. If you like I'll provide some more details after we're
>     done
>     >> > with everything.
>     >>
>     >> Yes, knowing the cause would be great and could be helpful to others!
>     >>
>     >> cheers
>     >>
>     >> rob
>     >>
>     >> >
>     >> > Alfred
>     >> >
>     >> > On Thu, Jun 10, 2021 at 11:02 AM Rob Crittenden
>     <rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     >> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> wrote:
>     >> >
>     >> >     Alfred Victor wrote:
>     >> >     > Hi all,
>     >> >     >
>     >> >     > Just curious if anyone has suggestions about that please
>     before I
>     >> get
>     >> >     > going in a couple of hours with conversions to IPA again?
>     I did
>     >> >     the math
>     >> >     > and 1,097,471 log messages in 5 hours is about 60 times per
>     >> second, so
>     >> >     > I'm gradually becoming more certain this is why we can
>     only boot
>     >> 20-30
>     >> >     > nodes at a time when we used to boot hundreds. However,
>     this is
>     >> still
>     >> >     > just a guess as I don't know the mechanism behind why this
>     >> interferes
>     >> >     > with IPA joins, some bottleneck with the KDC?
>     >> >
>     >> >     It sure seems like this is a kerberized NFS request for a
>     host that
>     >> >     doesn't provide it, or doesn't have an nfs principal. I
>     think you'd
>     >> need
>     >> >     to monitor the offending client to see what it is doing.
>     >> >
>     >> >     If the KDC is being dramatically slowed down by this then
>     yes, you
>     >> could
>     >> >     see slower Apache performance because it needs to obtain
>     tickets on
>     >> >     behalf of the user doing the join. Whether that would represent
>     >> itself
>     >> >     as a read timeout I don't know.
>     >> >
>     >> >     rob
>     >> >
>     >> >     >
>     >> >     > Alfred
>     >> >     >
>     >> >     > On Wed, Jun 9, 2021 at 3:49 PM Alfred Victor
>     <alvic...@gmail.com <mailto:alvic...@gmail.com>
>     >> >     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>
>     >> >     > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>
>     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>> wrote:
>     >> >     >
>     >> >     >     Hi Rob,
>     >> >     >
>     >> >     >     I have reduced that timeout and will tune it further.
>     >> >     Regarding ISE
>     >> >     >     errors, I think we can make the assumption that this is
>     >> >     entirely an
>     >> >     >     issue of the web timeouts, I haven't seen any evidence
>     >> >     otherwise and
>     >> >     >     will have another attempt at converting nodes
>     tomorrow, and
>     >> with a
>     >> >     >     keener eye of what to look for I can make a better
>     >> determination
>     >> >     >     then. I am most concerned over what the underlying
>     cause might
>     >> be
>     >> >     >     causing it to take too long and hit the timeout, and
>     don't
>     >> want to
>     >> >     >     engineer around this by changing Apache timeouts if
>     we can
>     >> instead
>     >> >     >     address the root cause. I am suspicious of krb log
>     messages
>     >> >     flooding
>     >> >     >     our IPA systems about a service principal like so,
>     but not
>     >> sure if
>     >> >     >     this is gumming up the works (or even why this
>     message has
>     >> started
>     >> >     >     appearing since the rebuild):
>     >> >     >
>     >> >     >     ./krb5kdc.log:Jun 09 10:38:51 redacted.redacted.com
>     <http://redacted.redacted.com>
>     >> >     <http://redacted.redacted.com> <http://redacted.redacted.com>
>     >> >     krb5kdc[31187](info): TGS_REQ (4 etypes {18 17 16 23})
>     10.1.1.27
>     >> >     <http://10.1.1.27>: LOOKING_UP_SERVER: authtime 0,
>     >> >     host/redacted.redacted....@redacted.com for
>     >> >     nfs/nfsserver.redacted....@redacted.com, Server not found in
>     >> >     Kerberos database
>     >> >     >
>     >> >     >
>     >> >     >     Just in the last 5 hours alone, this log message and
>     others
>     >> >     like it (main difference is just the nodename it
>     references) has
>     >> >     appeared 1,097,471 times. Conceivably there is also some
>     log write
>     >> >     locking or something going on that could be slowing IPA
>     down and
>     >> >     leading to our symptom here?
>     >> >     >
>     >> >     >     Alfred
>     >> >     >
>     >> >     >
>     >> >     >     On Wed, Jun 9, 2021 at 9:19 AM Rob Crittenden
>     >> >     <rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>
>     >> >     >     <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>>>>
>     >> wrote:
>     >> >     >
>     >> >     >         Alfred Victor wrote:
>     >> >     >         > Hi Rob,
>     >> >     >         >
>     >> >     >         > We did revert to 60s - I seem to remember some
>     ldapsearch
>     >> >     >         timing out
>     >> >     >         > previously but maybe we could still greatly
>     reduce this
>     >> with
>     >> >     >         no ill
>     >> >     >         > effect. However, we saw no change in join
>     success either
>     >> way
>     >> >     >         and I have
>     >> >     >         > not changed anything in Apache as I would need
>     to find
>     >> the
>     >> >     >         exact values
>     >> >     >         > in question and I think these directive changes
>     may get
>     >> lost
>     >> >     >         with an
>     >> >     >         > update? The timeout/ISE issues are new - we
>     previously
>     >> >     booted
>     >> >     >         many nodes
>     >> >     >         > concurrently (400+) without problems but now it
>     happens
>     >> even
>     >> >     >         booting as
>     >> >     >         > few as 50 results in 10-20 timeouts or ISE.
>     This has
>     >> >     been the
>     >> >     >         case at
>     >> >     >         > least since we rebuilt and re-replicated the
>     environment
>     >> by
>     >> >     >         swapping IPA
>     >> >     >         > members out of the old one, but maybe this is
>     unrelated.
>     >> >     Is it
>     >> >     >         possible
>     >> >     >         > we landed on a newer version with different timeout
>     >> >     values? Is it
>     >> >     >         > possible that IPA is under slightly more load
>     from the
>     >> >     higher
>     >> >     >         number of
>     >> >     >         > nodes and has some bottleneck since we last did
>     >> conversions?
>     >> >     >         We have not
>     >> >     >         > been able to substantiate the latter by looking at
>     >> >     >         CPU/memory/io trends.
>     >> >     >         > What might we investigate next to see if we may
>     have
>     >> missed
>     >> >     >         some ongoing
>     >> >     >         > issue that could for instance cause locking
>     problems or
>     >> >     >         something else
>     >> >     >         > internally in IPA to explain our symptoms?
>     Also, I have
>     >> >     >         observed some
>     >> >     >         > nodes which report a successful installation of IPA
>     >> client,
>     >> >     >         but have in
>     >> >     >         > fact a lot of failures, for instance with
>     mounts not
>     >> >     working from
>     >> >     >         > automount setup. We will need to try to
>     reproduce this to
>     >> >     >         understand
>     >> >     >         > what happened I think as we have already sorted
>     those
>     >> nodes
>     >> >     >         elsewhere
>     >> >     >         > and they have got going outside of IPA.
>     >> >     >         >
>     >> >     >         > I am interested that you suggested "at least
>     some of the
>     >> >     >         clients aren't
>     >> >     >         > connecting at all and increasing the timeout
>     could make
>     >> this
>     >> >     >         worse" -
>     >> >     >         > this might indicate some sort of network problem at
>     >> >     play, but
>     >> >     >         as far as
>     >> >     >         > I am aware, everything is working absent IPA so
>     I do not
>     >> >     >         suspect it
>     >> >     >         > presently.
>     >> >     >
>     >> >     >         A 60 second LDAP timeout is still way too big.
>     The default
>     >> >     is 2.
>     >> >     >         Unless
>     >> >     >          you are seeing timeouts I'd suggest lowering it.
>     >> >     >
>     >> >     >         I'm only seeing hints to what you're seeing and
>     the scope
>     >> so
>     >> >     >         it's hard
>     >> >     >         to make further suggestions. Are all the internal
>     errors
>     >> the
>     >> >     >         same, for
>     >> >     >         example? Are some failing due to LDAP timeouts or
>     are they
>     >> all
>     >> >     >         wsgi read
>     >> >     >         timeouts?
>     >> >     >
>     >> >     >         The Apache request timeout can be configured in
>     httpd.conf
>     >> >     which is
>     >> >     >         independent of the config files that IPA
>     currently writes
>     >> >     so it
>     >> >     >         should
>     >> >     >         survive upgrades.
>     >> >     >
>     >> >     >         rob
>     >> >     >         >
>     >> >     >         > Alfred
>     >> >     >         >
>     >> >     >         > On Mon, Jun 7, 2021 at 4:42 PM Rob Crittenden
>     >> >     >         <rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>
>     >> >     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>
>     >> >     >         > <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>>
>     >> >     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>>> wrote:
>     >> >     >         >
>     >> >     >         >     Alfred Victor wrote:
>     >> >     >         >     > Actually, no change happened from 300-> 600
>     >> >     timeout, the
>     >> >     >         web portal
>     >> >     >         >     > itself gave me an ISE I hadn't noticed
>     when I tried
>     >> >     >         clicking save!
>     >> >     >         >
>     >> >     >         >     I wasn't clear which log to look in. You'll see
>     >> details
>     >> >     >         about where the
>     >> >     >         >      error is caught in IPA in the Apache log.
>     To see
>     >> LDAP
>     >> >     >         timeouts you look
>     >> >     >         >     for err=3 in the 389-ds access log.
>     >> >     >         >
>     >> >     >         >     But since you posted a traceback this time it's
>     >> >     clear that
>     >> >     >         this is just
>     >> >     >         >     Apache waiting for client data to read so
>     any tuning
>     >> you
>     >> >     >         do needs to be
>     >> >     >         >     in Apache. You could try tuning Timeout which
>     >> >     defaults to
>     >> >     >         60 but this
>     >> >     >         >     doesn't seem likely to help since at least
>     some of
>     >> the
>     >> >     >         clients aren't
>     >> >     >         >     connecting at all and increasing the
>     timeout could
>     >> make
>     >> >     >         this worse.
>     >> >     >         >
>     >> >     >         >     Please revert the searchtimeout. 300 seconds is
>     >> >     orders of
>     >> >     >         magnitude
>     >> >     >         >     too big.
>     >> >     >         >
>     >> >     >         >     rob
>     >> >     >         >
>     >> >     >         >     >
>     >> >     >         >     > Alfred
>     >> >     >         >     >
>     >> >     >         >     > On Mon, Jun 7, 2021 at 3:57 PM Alfred Victor
>     >> >     >         <alvic...@gmail.com <mailto:alvic...@gmail.com>
>     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>
>     >> >     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>
>     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>
>     >> >     >         >     <mailto:alvic...@gmail.com
>     <mailto:alvic...@gmail.com>
>     >> >     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>
>     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>
>     >> >     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>>
>     >> >     >         >     > <mailto:alvic...@gmail.com
>     <mailto:alvic...@gmail.com>
>     >> >     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>
>     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>
>     >> >     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>
>     >> >     >         <mailto:alvic...@gmail.com
>     <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com
>     <mailto:alvic...@gmail.com>>
>     >> >     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>
>     <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>>>> wrote:
>     >> >     >         >     >
>     >> >     >         >     >     Hi FreeIPA list,
>     >> >     >         >     >
>     >> >     >         >     >     I don't see any in error log that
>     match `grep
>     >> >     -i "err=3"
>     >> >     >         >     >     /var/log/httpd/error_log`. We have tried
>     >> raising
>     >> >     >         >     searchtimelimit as
>     >> >     >         >     >     high as 120, then 300 (now are trying
>     600) but
>     >> >     >         observed no
>     >> >     >         >     >     difference in the rate at which nodes
>     >> succeeded or
>     >> >     >         failed in IPA
>     >> >     >         >     >     joins. We are somewhat puzzled by
>     this, as
>     >> none of
>     >> >     >         the other
>     >> >     >         >     values
>     >> >     >         >     >     we are aware of might have changed,
>     though it
>     >> is
>     >> >     >         possible that the
>     >> >     >         >     >     IPA systems are under a little higher
>     demand
>     >> >     from client
>     >> >     >         >     systems, we
>     >> >     >         >     >     have tried to mitigate this by
>     shutting down
>     >> some
>     >> >     >         workflows and
>     >> >     >         >     >     aren't sure whether we've seen any
>     improvement.
>     >> >     >         Short of adjusting
>     >> >     >         >     >     apache/resource/process timeouts it is
>     >> >     difficult to
>     >> >     >         say what might
>     >> >     >         >     >     be wrong. To give an example, out of
>     250 nodes
>     >> >     >         rebooted, only 112
>     >> >     >         >     >     joined IPA successfully. Here is some
>     output
>     >> from
>     >> >     >         the error log,
>     >> >     >         >     >     following what this looks like in the ipa
>     >> client
>     >> >     >         install log
>     >> >     >         >     (error
>     >> >     >         >     >     log output will match the node attempt):
>     >> >     >         >     >
>     >> >     >         >     >
>     >> >     >         >     >     2021-06-07T18:25:30Z DEBUG The
>     >> ipa-client-install
>     >> >     >         command
>     >> >     >         >     failed, exception: NetworkError: cannot
>     connect to
>     >> >     >         >     'https://redactednode.com/ipa/json
>     >> >     >         >     <https://hauth0004.dug.com/ipa/json>': Internal
>     >> >     Server Error
>     >> >     >         >     >
>     >> >     >         >     >     [Mon Jun 07 13:25:06.198259 2021]
>     [core:error]
>     >> >     [pid
>     >> >     >         25020]
>     >> >     >         >     [client 10.1.24.48:47808
>     <http://10.1.24.48:47808> <http://10.1.24.48:47808>
>     >> >     <http://10.1.24.48:47808>
>     >> >     >         <http://10.1.24.48:47808>
>     >> >     >         >     <http://10.1.24.48:47808>] Script timed out
>     before
>     >> >     returning
>     >> >     >         >     headers: wsgi.py, referer:
>     >> >     >         https://redacted.redacted.com/ipa/xml
>     >> >     >         >     <https://hauth0004.dug.com/ipa/xml>
>     >> >     >         >     >
>     >> >     >         >     >     Different node, same time period:
>     >> >     >         >     >
>     >> >     >         >     >
>     >> >     >         >     >     [Mon Jun 07 13:24:02.178092 2021]
>     [:error] [pid
>     >> >     >         25725] ipa:
>     >> >     >         >     INFO: [xmlserver] mach_j...@redacted.com:
>     >> >     >         join(u'redacted.node.com
>     <http://redacted.node.com> <http://redacted.node.com>
>     >> >     <http://redacted.node.com>
>     >> >     >         >     <http://redacted.node.com>
>     <http://redacted.node.com
>     >> >',
>     >> >     >         >     nshardwareplatform=u'x86_64',
>     >> >     >         >
>     >>  nsosversion=u'3.10.0-1062.18.1.1.el7.redacted.x86_64',
>     >> >     >         >     version=u'2.51'): TimeLimitExceeded
>     >> >     >         >     >
>     >> >     >         >     >     I also saw this:
>     >> >     >         >     >
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103503 2021]
>     [:error] [pid
>     >> >     >         25725] ipa:
>     >> >     >         >     ERROR: non-public: IOError: request data
>     read error
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103529 2021]
>     [:error]
>     >> >     [pid 25725]
>     >> >     >         >     Traceback (most recent call last):
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103536 2021]
>     [:error] [pid
>     >> >     >         25725]   File
>     >> >     >         >
>     >> >      "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py",
>     >> >     >         line 360,
>     >> >     >         >     in wsgi_execute
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103542 2021]
>     [:error] [pid
>     >> >     >         25725]
>     >> >     >         >      data = read_input(environ)
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103548 2021]
>     [:error] [pid
>     >> >     >         25725]   File
>     >> >     >         >
>     >> >      "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py",
>     >> >     >         line 200,
>     >> >     >         >     in read_input
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103553 2021]
>     [:error] [pid
>     >> >     >         25725]
>     >> >     >         >      return
>     >> >     environ['wsgi.input'].read(length).decode('utf-8')
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103559 2021]
>     [:error]
>     >> >     [pid 25725]
>     >> >     >         >     IOError: request data read error
>     >> >     >         >     >     [Mon Jun 07 13:25:07.103826 2021]
>     [:error] [pid
>     >> >     >         25725] ipa:
>     >> >     >         >     INFO: [xmlserver] mach_j...@redacted.com: None:
>     >> >     InternalError
>     >> >     >         >     >     [Mon Jun 07 13:25:07.149962 2021]
>     [:error] [pid
>     >> >     >         25726] ipa:
>     >> >     >         >     ERROR: non-public: IOError: request data
>     read error
>     >> >     >         >     >     [Mon Jun 07 13:25:07.149984 2021]
>     [:error]
>     >> >     [pid 25726]
>     >> >     >         >     Traceback (most recent call last):
>     >> >     >         >     >     [Mon Jun 07 13:25:07.149991 2021]
>     [:error] [pid
>     >> >     >         25726]   File
>     >> >     >         >
>     >> >      "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py",
>     >> >     >         line 360,
>     >> >     >         >     in wsgi_execute
>     >> >     >         >     >     [Mon Jun 07 13:25:07.149997 2021]
>     [:error] [pid
>     >> >     >         25726]
>     >> >     >         >      data = read_input(environ)
>     >> >     >         >     >     [Mon Jun 07 13:25:07.150002 2021]
>     [:error] [pid
>     >> >     >         25726]   File
>     >> >     >         >
>     >> >      "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py",
>     >> >     >         line 200,
>     >> >     >         >     in read_input
>     >> >     >         >     >     [Mon Jun 07 13:25:07.150008 2021]
>     [:error] [pid
>     >> >     >         25726]
>     >> >     >         >      return
>     >> >     environ['wsgi.input'].read(length).decode('utf-8')
>     >> >     >         >     >     [Mon Jun 07 13:25:07.150013 2021]
>     [:error]
>     >> >     [pid 25726]
>     >> >     >         >     IOError: request data read error
>     >> >     >         >     >
>     >> >     >         >     >
>     >> >     >         >     >
>     >> >     >         >     >     After setting the timeout to 600 and
>     rebooting
>     >> the
>     >> >     >         remaining
>     >> >     >         >     139 nodes from the initial set of 250, 83
>     joined of
>     >> the
>     >> >     >         139 and we
>     >> >     >         >     still had ISE occurring. In some cases, it
>     would ISE
>     >> on
>     >> >     >         the first
>     >> >     >         >     attempt, try another IPA system, and
>     succeed. I'm
>     >> >     not sure
>     >> >     >         that even
>     >> >     >         >     such a long timeout as 600 has helped.
>     >> >     >         >     >
>     >> >     >         >     >     Alfred
>     >> >     >         >     >
>     >> >     >         >     >
>     >> >     >         >     >
>     >> >     >         >     >
>     >> >     >         >     >     On Thu, Jun 3, 2021 at 7:51 PM Rob
>     Crittenden
>     >> >     >         >     <rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>>
>     >> >     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>
>     >> >     >         <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>>
>     >> >     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>>
>     >> >     >         >     >     <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>
>     >> >     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>
>     >> >     >         <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>>>
>     >> >     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>
>     <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>
>     >> >     >         <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>
>     >> >     <mailto:rcrit...@redhat.com
>     <mailto:rcrit...@redhat.com>>>>>> wrote:
>     >> >     >         >     >
>     >> >     >         >     >         Alfred Victor via FreeIPA-users
>     wrote:
>     >> >     >         >     >         > Hi FreeIPA list,
>     >> >     >         >     >         >
>     >> >     >         >     >         > We are having an issue with our IPA
>     >> >     >         environment of 4
>     >> >     >         >     >         replicated FreeIPA
>     >> >     >         >     >         > systems serving linux compute
>     clients
>     >> >     >         which join from a
>     >> >     >         >     command in
>     >> >     >         >     >         > rc.local after boot. This
>     worked in the
>     >> >     past,
>     >> >     >         but the system
>     >> >     >         >     >         has been
>     >> >     >         >     >         > rebuilt since and the join command
>     >> changed
>     >> >     >         >     slightly. Unfortunately
>     >> >     >         >     >         > booting a few dozen nodes at a
>     time,
>     >> though
>     >> >     >         they each
>     >> >     >         >     talk to a
>     >> >     >         >     >         > different IPA system by design,
>     leads to
>     >> >     >         problems such as
>     >> >     >         >     >         these - though
>     >> >     >         >     >         > 40-100 nodes can boot ok at a
>     time there
>     >> are
>     >> >     >         always many
>     >> >     >         >     >         stragglers, and
>     >> >     >         >     >         > the more we attempt to boot at
>     once the
>     >> more
>     >> >     >         fail to
>     >> >     >         >     join IPA
>     >> >     >         >     >         (if we try
>     >> >     >         >     >         > to boot 500 nodes, we are lucky
>     if we
>     >> get a
>     >> >     >         fifth of that
>     >> >     >         >     >         joining IPA).
>     >> >     >         >     >         > Can you please advise on this
>     output?
>     >> >     Here is
>     >> >     >         our join
>     >> >     >         >     command in
>     >> >     >         >     >         > compute node rc.local:
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         >     ipa-client-install -U -q -p
>     >> mach_join \
>     >> >     >         >     >         >     -w <redacted> \
>     >> >     >         >     >         >     --force-join \
>     >> >     >         >     >         >     --no-dns-sshfp \
>     >> >     >         >     >         >   
>      --automount-location=redacted-node
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         > And here is some log output of
>     the 500
>     >> >     error:
>     >> >     >         >     >         >
>     >> >     >         >     >         >     ProtocolError:
>     <ProtocolError for
>     >> >     >         >     >         redacted.redacted.com/ipa/json
>     <http://redacted.redacted.com/ipa/json>
>     >> >     <http://redacted.redacted.com/ipa/json>
>     >> >     >         <http://redacted.redacted.com/ipa/json>
>     >> >     >         >     <http://redacted.redacted.com/ipa/json>
>     >> >     >         >     >       
>      <http://redacted.redacted.com/ipa/json>
>     >> >     >         >     >       
>      <http://redacted.redacted.com/ipa/json>:
>     >> 500
>     >> >     >         Internal
>     >> >     >         >     Server Error>
>     >> >     >         >     >         >     Cannot connect to the
>     server due to
>     >> >     >         generic error:
>     >> >     >         >     cannot
>     >> >     >         >     >         connect to
>     >> >     'https://redacted.redacted.com/ipa/json
>     >> >     >         >     >       
>      <https://hauth0003.dug.com/ipa/json>':
>     >> >     Internal
>     >> >     >         Server Error
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         > As well as:
>     >> >     >         >     >         >
>     >> >     >         >     >         >     2021-06-02T21:39:11Z DEBUG
>     Starting
>     >> >     >         external process
>     >> >     >         >     >         >     2021-06-02T21:39:11Z DEBUG
>     >> >     >         args=/usr/sbin/ipa-join -s
>     >> >     >         >     >         >     redacted.redacted.com
>     <http://redacted.redacted.com>
>     >> >     <http://redacted.redacted.com>
>     >> >     >         <http://redacted.redacted.com> <
>     >> http://redacted.redacted.com>
>     >> >     >         >     <http://redacted.redacted.com>
>     >> >     >         >     >         <http://redacted.redacted.com> -b
>     >> >     >         >     >         >     dc=redacted,dc=com -h
>     >> >     >         redactednode.redacted.com
>     <http://redactednode.redacted.com>
>     >> >     <http://redactednode.redacted.com>
>     <http://redactednode.redacted.com
>     >> >
>     >> >     >         >     <http://redactednode.redacted.com>
>     >> >     >         >     >         <http://redactednode.redacted.com>
>     >> >     >         >     >         >   
>      <http://redactednode.redacted.com>
>     >> -f
>     >> >     >         >     >         >     2021-06-02T21:40:13Z DEBUG
>     Process
>     >> >     >         finished, return
>     >> >     >         >     code=17
>     >> >     >         >     >         >     2021-06-02T21:40:13Z DEBUG
>     stdout=
>     >> >     >         >     >         >     2021-06-02T21:40:13Z DEBUG
>     >> stderr=HTTP
>     >> >     >         response code is
>     >> >     >         >     >         500, not 200
>     >> >     >         >     >         >     2021-06-02T21:40:13Z ERROR
>     Joining
>     >> realm
>     >> >     >         failed: HTTP
>     >> >     >         >     >         response code
>     >> >     >         >     >         >     is 500, not 200
>     >> >     >         >     >         >
>     >> >     >         >     >         > And we also see timeouts happen:
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         >     2021-06-02T22:08:50Z DEBUG
>     >> >     >         args=/usr/sbin/ipa-join -s
>     >> >     >         >     >         >     redacted.redacted.com
>     <http://redacted.redacted.com>
>     >> >     <http://redacted.redacted.com>
>     >> >     >         <http://redacted.redacted.com> <
>     >> http://redacted.redacted.com>
>     >> >     >         >     <http://redacted.redacted.com>
>     >> >     >         >     >         <http://redacted.redacted.com> -b
>     >> >     >         >     >         >     dc=redacted,dc=com -h
>     >> >     >         redactednode.redacted.com
>     <http://redactednode.redacted.com>
>     >> >     <http://redactednode.redacted.com>
>     <http://redactednode.redacted.com
>     >> >
>     >> >     >         >     <http://redactednode.redacted.com>
>     >> >     >         >     >         <http://redactednode.redacted.com>
>     >> >     >         >     >         >   
>      <http://redactednode.redacted.com>
>     >> -f
>     >> >     >         >     >         >     2021-06-02T22:09:01Z DEBUG
>     Process
>     >> >     >         finished, return
>     >> >     >         >     code=17
>     >> >     >         >     >         >     2021-06-02T22:09:01Z DEBUG
>     stdout=
>     >> >     >         >     >         >     2021-06-02T22:09:01Z DEBUG
>     stderr=RPC
>     >> >     >         failed at server.
>     >> >     >         >     >         Configured
>     >> >     >         >     >         >     time limit exceeded
>     >> >     >         >     >         >     2021-06-02T22:09:01Z ERROR
>     Joining
>     >> realm
>     >> >     >         failed: RPC
>     >> >     >         >     failed at
>     >> >     >         >     >         >     server. Configured time limit
>     >> exceeded
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         > And we also see later timeouts
>     near the
>     >> >     end of
>     >> >     >         the log
>     >> >     >         >     in some
>     >> >     >         >  ��  >         cases though are able to
>     authenticate and
>     >> it
>     >> >     >         didn't back
>     >> >     >         >     out the
>     >> >     >         >     >         install, but never got going healthy
>     >> either:
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         >
>     >> >     >         >     >         >     2021-06-03T19:20:13Z DEBUG The
>     >> >     >         ipa-client-install
>     >> >     >         >     command
>     >> >     >         >     >         failed,
>     >> >     >         >     >         >     exception: TimeLimitExceeded:
>     >> Configured
>     >> >     >         time limit
>     >> >     >         >     exceeded
>     >> >     >         >     >
>     >> >     >         >     >         When you see Internal Error look
>     to the
>     >> Apache
>     >> >     >         error log
>     >> >     >         >     on the
>     >> >     >         >     >         server
>     >> >     >         >     >         for more information.
>     >> >     >         >     >
>     >> >     >         >     >         In this case an LDAP search is
>     failing
>     >> because
>     >> >     >         the server
>     >> >     >         >     is too
>     >> >     >         >     >         busy.
>     >> >     >         >     >         Look for queries failing with
>     err=3 to get
>     >> an
>     >> >     >         idea of how long
>     >> >     >         >     >         it is taking.
>     >> >     >         >     >
>     >> >     >         >     >         To increase the timeout use: ipa
>     config-mod
>     >> >     >         >     --searchtimelimit=INT
>     >> >     >         >     >
>     >> >     >         >     >         The default is 2 seconds.
>     >> >     >         >     >
>     >> >     >         >     >         You can pick a time at random but
>     could see
>     >> >     >         failures again.
>     >> >     >         >     >
>     >> >     >         >     >         rob
>     >> >     >         >     >
>     >> >     >         >
>     >> >     >
>     >> >
>     >>
>     >>
> 
> 
> 
> 
>     -- 
>     / Alexander Bokovoy
>     Sr. Principal Software Engineer
>     Security / Identity Management Engineering
>     Red Hat Limited, Finland
> 
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

[Freeipa-users] Re: Join command 500 errors, timeouts

Reply via email to