Alfred Victor wrote: > Hi Rob and FreeIPA list, > > I wish I was able to say we are seeing success but aside from no longer > having the krb5kdc.log log spam, we still experience a lot of join > failures. However rather than seeing a ton of timeouts (still a few but > much less), we get a lot of errors like below, which totally puzzle us > as the ipactl services all report up, and mach_join account should have > sufficient privilege. I will note we have our own non-IPA DNS solution > where we have added all the records IPA should need but I can't see why > this change would cause our errors (and indeed all the output of "ipa > *dns*-update-system-records --dry-run" looks ok, not that below errors > appear DNS related but its the only change I can think of aside from > having replaced our IPA hosts with re-replicated new builds.) > > This works fine on rerun of install, and I don't see any evidence of > service failure on the system this happened against, but neither am I > aware of any intermittent network problem (perhaps there is a network > problem on the very first connection to IPA we have not identified? but > I would expect different errors than the 401 under this for instance) > > 2021-06-21T19:12:05Z DEBUG Initializing principal > mach_j...@redac.com using password > 2021-06-21T19:12:05Z DEBUG Starting external process > 2021-06-21T19:12:05Z DEBUG args=/usr/bin/kinit mach_j...@redac.com > -c /tmp/krbcc_xp2E2/ccache > 2021-06-21T19:12:24Z DEBUG Process finished, return code=1 > 2021-06-21T19:12:24Z DEBUG stdout= > 2021-06-21T19:12:24Z DEBUG stderr=kinit: Cannot contact any KDC for > realm 'redac.COM' while getting initial credentials > 2021-06-21T19:12:24Z INFO Please make sure the following ports are > opened in the firewall settings: > TCP: 80, 88, 389 > UDP: 88 (at least one of TCP/UDP ports 88 has to be open) > Also note that following ports are necessary for ipa-client working > properly after enrollment: > TCP: 464 > UDP: 464, 123 (if NTP enabled) > 2021-06-21T19:12:24Z ERROR Installation failed. Rolling back changes. > > > > But then we also see things like this predominantly, in fact this seems > now to be the largest category of failure we observe at the highest rate > and makes no sense to me, because the mach_join account privileges > should not be changing: > > 2021-06-17T19:29:11Z DEBUG args=/usr/sbin/ipa-join -s > hipa0003.redac.com <http://hipa0003.redac.com> -b dc=redac,dc=com -h > hred2-4-18.redac.com <http://hred2-4-18.redac.com> -f > 2021-06-17T19:29:31Z DEBUG Process finished, return code=17 > 2021-06-17T19:29:31Z DEBUG stdout= > 2021-06-17T19:29:31Z DEBUG stderr=HTTP response code is 401, not 200 > 2021-06-17T19:29:31Z ERROR Joining realm failed: HTTP response code > is 401, not 200 > > > > Some more like this, but not quite the same error: > > > 2021-06-17T19:29:32Z DEBUG get_ca_certs_from_ldap() error: > Insufficient access: SASL(-1): generic failure: GSSAPI Error: > Unspecified GSS failure. Minor code may provide more information > (Cannot contact any KDC for realm 'redac.COM') > 2021-06-17T19:29:32Z DEBUG Insufficient access: SASL(-1): generic > failure: GSSAPI Error: Unspecified GSS failure. Minor code may > provide more information (Cannot contact any KDC for realm 'redac.COM') > > > And still another example: > > > > > 2021-06-17T22:25:37Z DEBUG args=/usr/sbin/ipa-join -s > hipa0001.redac.com <http://hipa0001.redac.com> -b dc=redac,dc=com -h > hred2-4-18.redac.com <http://hred2-4-18.redac.com> -f > 2021-06-17T22:25:41Z DEBUG Process finished, return code=17 > 2021-06-17T22:25:41Z DEBUG stdout= > 2021-06-17T22:25:41Z DEBUG stderr=No permission to join this host to > the IPA domain. > 2021-06-17T22:25:41Z ERROR Joining realm failed: No permission to > join this host to the IPA domain. > 2021-06-17T22:25:41Z ERROR Installation failed. Rolling back changes. > > Any thoughts or ideas please?
Some of these could be network issues. I'd run ipa-client-install with Kerberos tracing: KRB5_TRACE=/path/to/some/log ipa-client-install <options> So that on failure you can see what KDCs were available, whether one was contacted, etc. rob > > Alfred > > > On Tue, Jun 15, 2021 at 3:19 PM Rob Crittenden <rcrit...@redhat.com > <mailto:rcrit...@redhat.com>> wrote: > > Alfred Victor wrote: > > I don't see a directive equivalent of SECURE_NFS to add to > nfs.conf (all > > documentation seems to still refer to the sysconfig path), or is > it the > > same? Can I just disable rpcgssd? We have no nfs mounts which are > > kerberized yet, and disabling rpcgssd seems to solve our problem, > and I > > can kinit after disabling rpcgssd. It also does not seem that > disabling > > rpcgssd hurts running tasks or node, but would like to confirm it's > > limited to nfs in function. I still have to wonder if there's a > better way. > > I suspect that SECURE_NFS is a no-op these days. It was necessary in > RHEL 5/6 for sure. > > Yes, I think you can safely disable the service. In some versions of > RHEL/Fedora IIRC this is symlinked to nfs-secure-server so mask that as > well. > > rob > > > > > Alfred > > > > On Tue, Jun 15, 2021 at 10:31 AM Alexander Bokovoy > <aboko...@redhat.com <mailto:aboko...@redhat.com> > > <mailto:aboko...@redhat.com <mailto:aboko...@redhat.com>>> wrote: > > > > On ti, 15 kesä 2021, Alfred Victor via FreeIPA-users wrote: > > >Hi Rob, > > > > > >We attempted setting sec=sys on the mount, however to our > surprise > > found > > >this didn't work. We then figured out that IPA install is adding > > this to > > >/etc/sysconfig/nfs: > > > > > >SECURE_NFS=yes > > > > > > > > >We tried removing this to no avail and restarting all the related > > sytstemd > > >units (rpcgssd, nfs, etc). Any idea why sec=sys is being ignored? > > Should we > > >need to set SECURE_NFS=no? On non-IPA nodes this directive > does not > > exist > > >at all. For now, I have also totally disabled rpcgssd as I think > > this unit > > >may be responsible ( it seems that it does the upcall in > > >https://access.redhat.com/solutions/225783 ) so I will hope > that this > > >solves it, I don't believe anything else depends on rpcgssd but > > will soon > > >find out. Any suggestions please? :) > > > > Depends on nfs-utils version? I remember there had been a > change in > > configuration in upstream nfs-utils in 2019: > > > > commit c69875c8afdd877baf7139c0cd5241f70105cbd4 > > Author: François Cami <fc...@redhat.com > <mailto:fc...@redhat.com> <mailto:fc...@redhat.com > <mailto:fc...@redhat.com>>> > > Date: Tue Feb 26 13:59:06 2019 +0100 > > > > ipa-client-automount: handle NFS configuration file changes > > > > nfs-utils in Fedora 30 and later switched its configuration > > file from /etc/sysconfig/nfs to /etc/nfs.conf, providing a > > conversion service (nfs-convert.service) for upgrades. > > However, for new installs the original configuration file > > is missing. This change: > > * adds a tuple-based osinfo.version_number method to handle > > more kinds of OS versioning schemes > > * detects RHEL and Fedora versions with the the new nfs-utils > > behavior > > * avoids backing up the new NFS configuration file as we do > > not have to modify it. > > > > See: https://bugzilla.redhat.com/show_bug.cgi?id=1676981 > > > > Fixes: https://pagure.io/freeipa/issue/7868 > > > > > > > > > >Alfred > > > > > >On Thu, Jun 10, 2021 at 2:17 PM Rob Crittenden > <rcrit...@redhat.com <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> wrote: > > > > > >> Alfred Victor wrote: > > >> > Thanks very much Rob et al! I believe we have found our root > > cause and > > >> > the fix. If you like I'll provide some more details after > we're > > done > > >> > with everything. > > >> > > >> Yes, knowing the cause would be great and could be helpful > to others! > > >> > > >> cheers > > >> > > >> rob > > >> > > >> > > > >> > Alfred > > >> > > > >> > On Thu, Jun 10, 2021 at 11:02 AM Rob Crittenden > > <rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > > >> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>> wrote: > > >> > > > >> > Alfred Victor wrote: > > >> > > Hi all, > > >> > > > > >> > > Just curious if anyone has suggestions about that > please > > before I > > >> get > > >> > > going in a couple of hours with conversions to IPA > again? > > I did > > >> > the math > > >> > > and 1,097,471 log messages in 5 hours is about 60 > times per > > >> second, so > > >> > > I'm gradually becoming more certain this is why we can > > only boot > > >> 20-30 > > >> > > nodes at a time when we used to boot hundreds. However, > > this is > > >> still > > >> > > just a guess as I don't know the mechanism behind > why this > > >> interferes > > >> > > with IPA joins, some bottleneck with the KDC? > > >> > > > >> > It sure seems like this is a kerberized NFS request for a > > host that > > >> > doesn't provide it, or doesn't have an nfs principal. I > > think you'd > > >> need > > >> > to monitor the offending client to see what it is doing. > > >> > > > >> > If the KDC is being dramatically slowed down by this then > > yes, you > > >> could > > >> > see slower Apache performance because it needs to obtain > > tickets on > > >> > behalf of the user doing the join. Whether that would > represent > > >> itself > > >> > as a read timeout I don't know. > > >> > > > >> > rob > > >> > > > >> > > > > >> > > Alfred > > >> > > > > >> > > On Wed, Jun 9, 2021 at 3:49 PM Alfred Victor > > <alvic...@gmail.com <mailto:alvic...@gmail.com> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>> > > >> > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>>> > > >> > > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>>> wrote: > > >> > > > > >> > > Hi Rob, > > >> > > > > >> > > I have reduced that timeout and will tune it > further. > > >> > Regarding ISE > > >> > > errors, I think we can make the assumption that > this is > > >> > entirely an > > >> > > issue of the web timeouts, I haven't seen any > evidence > > >> > otherwise and > > >> > > will have another attempt at converting nodes > > tomorrow, and > > >> with a > > >> > > keener eye of what to look for I can make a better > > >> determination > > >> > > then. I am most concerned over what the underlying > > cause might > > >> be > > >> > > causing it to take too long and hit the > timeout, and > > don't > > >> want to > > >> > > engineer around this by changing Apache timeouts if > > we can > > >> instead > > >> > > address the root cause. I am suspicious of krb log > > messages > > >> > flooding > > >> > > our IPA systems about a service principal like so, > > but not > > >> sure if > > >> > > this is gumming up the works (or even why this > > message has > > >> started > > >> > > appearing since the rebuild): > > >> > > > > >> > > ./krb5kdc.log:Jun 09 10:38:51 > redacted.redacted.com <http://redacted.redacted.com> > > <http://redacted.redacted.com> > > >> > <http://redacted.redacted.com> > <http://redacted.redacted.com> > > >> > krb5kdc[31187](info): TGS_REQ (4 etypes {18 17 16 23}) > > 10.1.1.27 > > >> > <http://10.1.1.27>: LOOKING_UP_SERVER: authtime 0, > > >> > host/redacted.redacted....@redacted.com for > > >> > nfs/nfsserver.redacted....@redacted.com, Server not > found in > > >> > Kerberos database > > >> > > > > >> > > > > >> > > Just in the last 5 hours alone, this log > message and > > others > > >> > like it (main difference is just the nodename it > > references) has > > >> > appeared 1,097,471 times. Conceivably there is also some > > log write > > >> > locking or something going on that could be slowing IPA > > down and > > >> > leading to our symptom here? > > >> > > > > >> > > Alfred > > >> > > > > >> > > > > >> > > On Wed, Jun 9, 2021 at 9:19 AM Rob Crittenden > > >> > <rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> > > >> > > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>>> > > >> wrote: > > >> > > > > >> > > Alfred Victor wrote: > > >> > > > Hi Rob, > > >> > > > > > >> > > > We did revert to 60s - I seem to remember > some > > ldapsearch > > >> > > timing out > > >> > > > previously but maybe we could still greatly > > reduce this > > >> with > > >> > > no ill > > >> > > > effect. However, we saw no change in join > > success either > > >> way > > >> > > and I have > > >> > > > not changed anything in Apache as I would > need > > to find > > >> the > > >> > > exact values > > >> > > > in question and I think these directive > changes > > may get > > >> lost > > >> > > with an > > >> > > > update? The timeout/ISE issues are new - we > > previously > > >> > booted > > >> > > many nodes > > >> > > > concurrently (400+) without problems but > now it > > happens > > >> even > > >> > > booting as > > >> > > > few as 50 results in 10-20 timeouts or ISE. > > This has > > >> > been the > > >> > > case at > > >> > > > least since we rebuilt and re-replicated the > > environment > > >> by > > >> > > swapping IPA > > >> > > > members out of the old one, but maybe this is > > unrelated. > > >> > Is it > > >> > > possible > > >> > > > we landed on a newer version with > different timeout > > >> > values? Is it > > >> > > > possible that IPA is under slightly more load > > from the > > >> > higher > > >> > > number of > > >> > > > nodes and has some bottleneck since we > last did > > >> conversions? > > >> > > We have not > > >> > > > been able to substantiate the latter by > looking at > > >> > > CPU/memory/io trends. > > >> > > > What might we investigate next to see if > we may > > have > > >> missed > > >> > > some ongoing > > >> > > > issue that could for instance cause locking > > problems or > > >> > > something else > > >> > > > internally in IPA to explain our symptoms? > > Also, I have > > >> > > observed some > > >> > > > nodes which report a successful > installation of IPA > > >> client, > > >> > > but have in > > >> > > > fact a lot of failures, for instance with > > mounts not > > >> > working from > > >> > > > automount setup. We will need to try to > > reproduce this to > > >> > > understand > > >> > > > what happened I think as we have already > sorted > > those > > >> nodes > > >> > > elsewhere > > >> > > > and they have got going outside of IPA. > > >> > > > > > >> > > > I am interested that you suggested "at least > > some of the > > >> > > clients aren't > > >> > > > connecting at all and increasing the timeout > > could make > > >> this > > >> > > worse" - > > >> > > > this might indicate some sort of network > problem at > > >> > play, but > > >> > > as far as > > >> > > > I am aware, everything is working absent > IPA so > > I do not > > >> > > suspect it > > >> > > > presently. > > >> > > > > >> > > A 60 second LDAP timeout is still way too big. > > The default > > >> > is 2. > > >> > > Unless > > >> > > you are seeing timeouts I'd suggest > lowering it. > > >> > > > > >> > > I'm only seeing hints to what you're seeing and > > the scope > > >> so > > >> > > it's hard > > >> > > to make further suggestions. Are all the > internal > > errors > > >> the > > >> > > same, for > > >> > > example? Are some failing due to LDAP > timeouts or > > are they > > >> all > > >> > > wsgi read > > >> > > timeouts? > > >> > > > > >> > > The Apache request timeout can be configured in > > httpd.conf > > >> > which is > > >> > > independent of the config files that IPA > > currently writes > > >> > so it > > >> > > should > > >> > > survive upgrades. > > >> > > > > >> > > rob > > >> > > > > > >> > > > Alfred > > >> > > > > > >> > > > On Mon, Jun 7, 2021 at 4:42 PM Rob Crittenden > > >> > > <rcrit...@redhat.com > <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com>> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> > > >> > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com>> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>> > > >> > > > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> > > >> > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com>> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>>>> wrote: > > >> > > > > > >> > > > Alfred Victor wrote: > > >> > > > > Actually, no change happened from > 300-> 600 > > >> > timeout, the > > >> > > web portal > > >> > > > > itself gave me an ISE I hadn't noticed > > when I tried > > >> > > clicking save! > > >> > > > > > >> > > > I wasn't clear which log to look in. > You'll see > > >> details > > >> > > about where the > > >> > > > error is caught in IPA in the Apache > log. > > To see > > >> LDAP > > >> > > timeouts you look > > >> > > > for err=3 in the 389-ds access log. > > >> > > > > > >> > > > But since you posted a traceback this > time it's > > >> > clear that > > >> > > this is just > > >> > > > Apache waiting for client data to read so > > any tuning > > >> you > > >> > > do needs to be > > >> > > > in Apache. You could try tuning > Timeout which > > >> > defaults to > > >> > > 60 but this > > >> > > > doesn't seem likely to help since at > least > > some of > > >> the > > >> > > clients aren't > > >> > > > connecting at all and increasing the > > timeout could > > >> make > > >> > > this worse. > > >> > > > > > >> > > > Please revert the searchtimeout. 300 > seconds is > > >> > orders of > > >> > > magnitude > > >> > > > too big. > > >> > > > > > >> > > > rob > > >> > > > > > >> > > > > > > >> > > > > Alfred > > >> > > > > > > >> > > > > On Mon, Jun 7, 2021 at 3:57 PM > Alfred Victor > > >> > > <alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>> > > >> > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>> > > >> > > > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>> > > >> > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>>> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>> > > >> > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>>>>> > > >> > > > > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>> > > >> > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>>> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>> > > >> > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>>>> > > >> > > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>> > > >> > <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com> <mailto:alvic...@gmail.com > <mailto:alvic...@gmail.com>> > > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com> > <mailto:alvic...@gmail.com <mailto:alvic...@gmail.com>>>>>>> wrote: > > >> > > > > > > >> > > > > Hi FreeIPA list, > > >> > > > > > > >> > > > > I don't see any in error log that > > match `grep > > >> > -i "err=3" > > >> > > > > /var/log/httpd/error_log`. We > have tried > > >> raising > > >> > > > searchtimelimit as > > >> > > > > high as 120, then 300 (now are > trying > > 600) but > > >> > > observed no > > >> > > > > difference in the rate at which > nodes > > >> succeeded or > > >> > > failed in IPA > > >> > > > > joins. We are somewhat puzzled by > > this, as > > >> none of > > >> > > the other > > >> > > > values > > >> > > > > we are aware of might have changed, > > though it > > >> is > > >> > > possible that the > > >> > > > > IPA systems are under a little > higher > > demand > > >> > from client > > >> > > > systems, we > > >> > > > > have tried to mitigate this by > > shutting down > > >> some > > >> > > workflows and > > >> > > > > aren't sure whether we've seen any > > improvement. > > >> > > Short of adjusting > > >> > > > > apache/resource/process > timeouts it is > > >> > difficult to > > >> > > say what might > > >> > > > > be wrong. To give an example, > out of > > 250 nodes > > >> > > rebooted, only 112 > > >> > > > > joined IPA successfully. Here > is some > > output > > >> from > > >> > > the error log, > > >> > > > > following what this looks like > in the ipa > > >> client > > >> > > install log > > >> > > > (error > > >> > > > > log output will match the node > attempt): > > >> > > > > > > >> > > > > > > >> > > > > 2021-06-07T18:25:30Z DEBUG The > > >> ipa-client-install > > >> > > command > > >> > > > failed, exception: NetworkError: cannot > > connect to > > >> > > > 'https://redactednode.com/ipa/json > > >> > > > > <https://hauth0004.dug.com/ipa/json>': Internal > > >> > Server Error > > >> > > > > > > >> > > > > [Mon Jun 07 13:25:06.198259 2021] > > [core:error] > > >> > [pid > > >> > > 25020] > > >> > > > [client 10.1.24.48:47808 > <http://10.1.24.48:47808> > > <http://10.1.24.48:47808> <http://10.1.24.48:47808> > > >> > <http://10.1.24.48:47808> > > >> > > <http://10.1.24.48:47808> > > >> > > > <http://10.1.24.48:47808>] Script > timed out > > before > > >> > returning > > >> > > > headers: wsgi.py, referer: > > >> > > https://redacted.redacted.com/ipa/xml > > >> > > > <https://hauth0004.dug.com/ipa/xml> > > >> > > > > > > >> > > > > Different node, same time period: > > >> > > > > > > >> > > > > > > >> > > > > [Mon Jun 07 13:24:02.178092 2021] > > [:error] [pid > > >> > > 25725] ipa: > > >> > > > INFO: [xmlserver] mach_j...@redacted.com: > > >> > > join(u'redacted.node.com > <http://redacted.node.com> > > <http://redacted.node.com> <http://redacted.node.com> > > >> > <http://redacted.node.com> > > >> > > > <http://redacted.node.com> > > <http://redacted.node.com > > >> >', > > >> > > > nshardwareplatform=u'x86_64', > > >> > > > > > >> nsosversion=u'3.10.0-1062.18.1.1.el7.redacted.x86_64', > > >> > > > version=u'2.51'): TimeLimitExceeded > > >> > > > > > > >> > > > > I also saw this: > > >> > > > > > > >> > > > > [Mon Jun 07 13:25:07.103503 2021] > > [:error] [pid > > >> > > 25725] ipa: > > >> > > > ERROR: non-public: IOError: request data > > read error > > >> > > > > [Mon Jun 07 13:25:07.103529 2021] > > [:error] > > >> > [pid 25725] > > >> > > > Traceback (most recent call last): > > >> > > > > [Mon Jun 07 13:25:07.103536 2021] > > [:error] [pid > > >> > > 25725] File > > >> > > > > > >> > > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", > > >> > > line 360, > > >> > > > in wsgi_execute > > >> > > > > [Mon Jun 07 13:25:07.103542 2021] > > [:error] [pid > > >> > > 25725] > > >> > > > data = read_input(environ) > > >> > > > > [Mon Jun 07 13:25:07.103548 2021] > > [:error] [pid > > >> > > 25725] File > > >> > > > > > >> > > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", > > >> > > line 200, > > >> > > > in read_input > > >> > > > > [Mon Jun 07 13:25:07.103553 2021] > > [:error] [pid > > >> > > 25725] > > >> > > > return > > >> > environ['wsgi.input'].read(length).decode('utf-8') > > >> > > > > [Mon Jun 07 13:25:07.103559 2021] > > [:error] > > >> > [pid 25725] > > >> > > > IOError: request data read error > > >> > > > > [Mon Jun 07 13:25:07.103826 2021] > > [:error] [pid > > >> > > 25725] ipa: > > >> > > > INFO: [xmlserver] > mach_j...@redacted.com: None: > > >> > InternalError > > >> > > > > [Mon Jun 07 13:25:07.149962 2021] > > [:error] [pid > > >> > > 25726] ipa: > > >> > > > ERROR: non-public: IOError: request data > > read error > > >> > > > > [Mon Jun 07 13:25:07.149984 2021] > > [:error] > > >> > [pid 25726] > > >> > > > Traceback (most recent call last): > > >> > > > > [Mon Jun 07 13:25:07.149991 2021] > > [:error] [pid > > >> > > 25726] File > > >> > > > > > >> > > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", > > >> > > line 360, > > >> > > > in wsgi_execute > > >> > > > > [Mon Jun 07 13:25:07.149997 2021] > > [:error] [pid > > >> > > 25726] > > >> > > > data = read_input(environ) > > >> > > > > [Mon Jun 07 13:25:07.150002 2021] > > [:error] [pid > > >> > > 25726] File > > >> > > > > > >> > > "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", > > >> > > line 200, > > >> > > > in read_input > > >> > > > > [Mon Jun 07 13:25:07.150008 2021] > > [:error] [pid > > >> > > 25726] > > >> > > > return > > >> > environ['wsgi.input'].read(length).decode('utf-8') > > >> > > > > [Mon Jun 07 13:25:07.150013 2021] > > [:error] > > >> > [pid 25726] > > >> > > > IOError: request data read error > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > After setting the timeout to > 600 and > > rebooting > > >> the > > >> > > remaining > > >> > > > 139 nodes from the initial set of 250, 83 > > joined of > > >> the > > >> > > 139 and we > > >> > > > still had ISE occurring. In some > cases, it > > would ISE > > >> on > > >> > > the first > > >> > > > attempt, try another IPA system, and > > succeed. I'm > > >> > not sure > > >> > > that even > > >> > > > such a long timeout as 600 has helped. > > >> > > > > > > >> > > > > Alfred > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > On Thu, Jun 3, 2021 at 7:51 PM Rob > > Crittenden > > >> > > > <rcrit...@redhat.com > <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> > > >> > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com>> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>> > > >> > > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> > > >> > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com>> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>>> > > >> > > > > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > > >> > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com>>> > > >> > > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>> > > >> > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com>> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>> > > >> > > <mailto:rcrit...@redhat.com > <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>> > > >> > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com> > > <mailto:rcrit...@redhat.com <mailto:rcrit...@redhat.com>>>>>>> > wrote: > > >> > > > > > > >> > > > > Alfred Victor via FreeIPA-users > > wrote: > > >> > > > > > Hi FreeIPA list, > > >> > > > > > > > >> > > > > > We are having an issue > with our IPA > > >> > > environment of 4 > > >> > > > > replicated FreeIPA > > >> > > > > > systems serving linux compute > > clients > > >> > > which join from a > > >> > > > command in > > >> > > > > > rc.local after boot. This > > worked in the > > >> > past, > > >> > > but the system > > >> > > > > has been > > >> > > > > > rebuilt since and the > join command > > >> changed > > >> > > > slightly. Unfortunately > > >> > > > > > booting a few dozen nodes > at a > > time, > > >> though > > >> > > they each > > >> > > > talk to a > > >> > > > > > different IPA system by > design, > > leads to > > >> > > problems such as > > >> > > > > these - though > > >> > > > > > 40-100 nodes can boot ok at a > > time there > > >> are > > >> > > always many > > >> > > > > stragglers, and > > >> > > > > > the more we attempt to > boot at > > once the > > >> more > > >> > > fail to > > >> > > > join IPA > > >> > > > > (if we try > > >> > > > > > to boot 500 nodes, we are > lucky > > if we > > >> get a > > >> > > fifth of that > > >> > > > > joining IPA). > > >> > > > > > Can you please advise on this > > output? > > >> > Here is > > >> > > our join > > >> > > > command in > > >> > > > > > compute node rc.local: > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > ipa-client-install -U > -q -p > > >> mach_join \ > > >> > > > > > -w <redacted> \ > > >> > > > > > --force-join \ > > >> > > > > > --no-dns-sshfp \ > > >> > > > > > > > --automount-location=redacted-node > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > And here is some log > output of > > the 500 > > >> > error: > > >> > > > > > > > >> > > > > > ProtocolError: > > <ProtocolError for > > >> > > > > > redacted.redacted.com/ipa/json <http://redacted.redacted.com/ipa/json> > > <http://redacted.redacted.com/ipa/json> > > >> > <http://redacted.redacted.com/ipa/json> > > >> > > <http://redacted.redacted.com/ipa/json> > > >> > > > <http://redacted.redacted.com/ipa/json> > > >> > > > > > > <http://redacted.redacted.com/ipa/json> > > >> > > > > > > <http://redacted.redacted.com/ipa/json>: > > >> 500 > > >> > > Internal > > >> > > > Server Error> > > >> > > > > > Cannot connect to the > > server due to > > >> > > generic error: > > >> > > > cannot > > >> > > > > connect to > > >> > 'https://redacted.redacted.com/ipa/json > > >> > > > > > > <https://hauth0003.dug.com/ipa/json>': > > >> > Internal > > >> > > Server Error > > >> > > > > > > > >> > > > > > > > >> > > > > > As well as: > > >> > > > > > > > >> > > > > > 2021-06-02T21:39:11Z > DEBUG > > Starting > > >> > > external process > > >> > > > > > 2021-06-02T21:39:11Z > DEBUG > > >> > > args=/usr/sbin/ipa-join -s > > >> > > > > > redacted.redacted.com > <http://redacted.redacted.com> > > <http://redacted.redacted.com> > > >> > <http://redacted.redacted.com> > > >> > > <http://redacted.redacted.com> < > > >> http://redacted.redacted.com> > > >> > > > <http://redacted.redacted.com> > > >> > > > > > <http://redacted.redacted.com> -b > > >> > > > > > dc=redacted,dc=com -h > > >> > > redactednode.redacted.com > <http://redactednode.redacted.com> > > <http://redactednode.redacted.com> > > >> > <http://redactednode.redacted.com> > > <http://redactednode.redacted.com > > >> > > > >> > > > <http://redactednode.redacted.com> > > >> > > > > > <http://redactednode.redacted.com> > > >> > > > > > > > <http://redactednode.redacted.com> > > >> -f > > >> > > > > > 2021-06-02T21:40:13Z > DEBUG > > Process > > >> > > finished, return > > >> > > > code=17 > > >> > > > > > 2021-06-02T21:40:13Z > DEBUG > > stdout= > > >> > > > > > 2021-06-02T21:40:13Z > DEBUG > > >> stderr=HTTP > > >> > > response code is > > >> > > > > 500, not 200 > > >> > > > > > 2021-06-02T21:40:13Z > ERROR > > Joining > > >> realm > > >> > > failed: HTTP > > >> > > > > response code > > >> > > > > > is 500, not 200 > > >> > > > > > > > >> > > > > > And we also see timeouts > happen: > > >> > > > > > > > >> > > > > > > > >> > > > > > 2021-06-02T22:08:50Z > DEBUG > > >> > > args=/usr/sbin/ipa-join -s > > >> > > > > > redacted.redacted.com > <http://redacted.redacted.com> > > <http://redacted.redacted.com> > > >> > <http://redacted.redacted.com> > > >> > > <http://redacted.redacted.com> < > > >> http://redacted.redacted.com> > > >> > > > <http://redacted.redacted.com> > > >> > > > > > <http://redacted.redacted.com> -b > > >> > > > > > dc=redacted,dc=com -h > > >> > > redactednode.redacted.com > <http://redactednode.redacted.com> > > <http://redactednode.redacted.com> > > >> > <http://redactednode.redacted.com> > > <http://redactednode.redacted.com > > >> > > > >> > > > <http://redactednode.redacted.com> > > >> > > > > > <http://redactednode.redacted.com> > > >> > > > > > > > <http://redactednode.redacted.com> > > >> -f > > >> > > > > > 2021-06-02T22:09:01Z > DEBUG > > Process > > >> > > finished, return > > >> > > > code=17 > > >> > > > > > 2021-06-02T22:09:01Z > DEBUG > > stdout= > > >> > > > > > 2021-06-02T22:09:01Z > DEBUG > > stderr=RPC > > >> > > failed at server. > > >> > > > > Configured > > >> > > > > > time limit exceeded > > >> > > > > > 2021-06-02T22:09:01Z > ERROR > > Joining > > >> realm > > >> > > failed: RPC > > >> > > > failed at > > >> > > > > > server. Configured > time limit > > >> exceeded > > >> > > > > > > > >> > > > > > > > >> > > > > > And we also see later > timeouts > > near the > > >> > end of > > >> > > the log > > >> > > > in some > > >> > > > �� > cases though are able to > > authenticate and > > >> it > > >> > > didn't back > > >> > > > out the > > >> > > > > install, but never got > going healthy > > >> either: > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > 2021-06-03T19:20:13Z > DEBUG The > > >> > > ipa-client-install > > >> > > > command > > >> > > > > failed, > > >> > > > > > exception: > TimeLimitExceeded: > > >> Configured > > >> > > time limit > > >> > > > exceeded > > >> > > > > > > >> > > > > When you see Internal Error > look > > to the > > >> Apache > > >> > > error log > > >> > > > on the > > >> > > > > server > > >> > > > > for more information. > > >> > > > > > > >> > > > > In this case an LDAP search is > > failing > > >> because > > >> > > the server > > >> > > > is too > > >> > > > > busy. > > >> > > > > Look for queries failing with > > err=3 to get > > >> an > > >> > > idea of how long > > >> > > > > it is taking. > > >> > > > > > > >> > > > > To increase the timeout > use: ipa > > config-mod > > >> > > > --searchtimelimit=INT > > >> > > > > > > >> > > > > The default is 2 seconds. > > >> > > > > > > >> > > > > You can pick a time at > random but > > could see > > >> > > failures again. > > >> > > > > > > >> > > > > rob > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > > > > > > > -- > > / Alexander Bokovoy > > Sr. Principal Software Engineer > > Security / Identity Management Engineering > > Red Hat Limited, Finland > > > _______________________________________________ FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure