[Freeipa-users] Re: Scaling and Misc

Mark Potter via FreeIPA-users Thu, 28 Jan 2021 10:35:07 -0800

So the DNS overload was my own fault. I was using 'while' in Ansible and
doing an entry at a time instead of just generating a playbook that adds
multiple entries. I've tested with 100 entries and had a single update per
zone to the replicas. So I've sorted that. I shouldn't Ansible on almost no
sleep.


However, the question about ~9k diskless nodes booting and all running
'ipa-client-install' with the '--force' option overloading the cluster in
the same manner.

On Thu, Jan 28, 2021, 11:14 AM Mark Potter <ma...@dug.com> wrote:

> The docs say 2k to 3k hosts per FreeIPA machine. We currently have 1
> server and 3 replicas for ~9k hosts. The issue is that the hosts in
> question are stateless so have to have ipa-client-install run every boot.
> We've got that part handled but something came up that's got me concerne.
>
> I was adding DNS records using ansible-freeipa. With needing DNS for all
> of our sites along with BMC and such we have about ~38k valid DNS entries.
> I was running two playbooks to add entries in parallel because we need
> everything to resolve on example.com and example1.com. This is an
> artifact that can't be avoided so we end up ~76k entries across two zones.
>
> Example.com was adding with reverse and example1.com was adding without
> reverse. Based on the time it took for each entry to be added this should
> have taken ~31 hours. At some point the three replicas stopped responding
> to any requests. For instance ipa1.example.com (primary) would validate
> while adding a host but example2.com (replica) would hang and never
> timeout. Eventually both playbooks failed at ~33k DNS entries as ssh wasn't
> responding on the primary. I wasn't monitoring at that point so I didn't
> get to see it happen. There is nothing from OOM in the logs so it doesn't
> look like ssshd got killed from memory usage, when I was monitoring load
> never got over2.
>
> The VMs have 16GiB memory, 6 cores, and a 10Gib connection. They are
> running CentOS 7 with FreeIPA 4.6.5. Logs on ipa1 show:
>
> Jan 27 11:39:14 ipa1 ns-slapd: [27/Jan/2021:11:39:14.363156372 -0600] -
> WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=
> meToipa2.example.com" (ipa2:389): Unable to receive the response for a
> startReplication extended operation to consumer (Timed out). Will retry
> later.
>
> For both the left and right replicas (ipa2 and ipa3).
>
> The replicas show:
>
> Jan 27 16:38:02 ipa2 named-pkcs11[2516]: LDAP query timed out. Try to
> adjust "timeout" parameter
> Jan 27 16:38:02 ipa2 named-pkcs11[2516]: zone example.com/IN: serial
> (1611787052) write back to LDAP failed
> Jan 27 16:38:12 ipa2 named-pkcs11[2516]: LDAP query timed out. Try to
> adjust "timeout" parameter
> Jan 27 16:38:12 ipa2 named-pkcs11[2516]: zone 16.172.in-addr.arpa/IN:
> serial (1611787062) write back to LDAP failed
>
> Which eventually became:
>
> Jan 27 16:57:32 ipa2 named-pkcs11[2516]: zone example.com/IN: serial
> (1611788192) write back to LDAP failed
> Jan 27 16:58:22 ipa2 named-pkcs11[2516]: timeout in
> ldap_pool_getconnection(): try to raise 'connections' parameter; potential
> deadlock?
>
> This was happening in the krb5kdc.log on the replicas around the same time:
>
> Jan 27 15:20:41 ipa2.example.com krb5kdc[26712](info): AS_REQ (8 etypes
> {18 17 20 19 16 23 25 26}) 10.201.1.5: LOOKING_UP_CLIENT:
> ma...@example.com for krbtgt/example....@example.com, Server error
>
> In dirsrv/slapd-EXAMPLE-COM/error in the same timeframe:
>
> [27/Jan/2021:15:58:04.885131721 -0600] - ERR - NSMMReplicationPlugin -
> bind_and_check_pwp - agmt="cn=ipa2.example.com-to-ipa3.example.com"
> (ipa3:389) - Replication bind with GSSAPI auth failed: LDAP error -1 (Can't
> contact LDAP server) ()
>
> While not frequently these appeared again until I rebooted the replicas
> this morning. I could restart with 'ipactl restart`, it would just hang. I
> let it sit for ten minutes at one point. `ipactl status` consistently
> showed everything running.
>
> My topology looks like this (both ca and domain are the same)
>
> ------------------
> 5 segments matched
> ------------------
>   Segment name: ipa1.example.com-to-ipa2.example.com
>   Left node: ipa1.example.com
>   Right node: ipa2.example.com
>   Connectivity: both
>
>   Segment name: ipa1.example.com-to-ipa3.example.com
>   Left node: ipa1.example.com
>   Right node: ipa3.example.com
>   Connectivity: both
>
>   Segment name: ipa1.example.com-to-ipa4.example.com
>   Left node: ipa1.example.com
>   Right node: ipa4.example.com
>   Connectivity: both
>
>   Segment name: ipa3.example.com-to-ipa2.example.com
>   Left node: ipa3.example.com
>   Right node: ipa2.example.com
>   Connectivity: both
>
>   Segment name: ipa4.example.com-to-ipa3.example.com
>   Left node: ipa4.example.com
>   Right node: ipa3.example.com
>   Connectivity: both
> ----------------------------
> Number of entries returned 5
> ----------------------------
>
> Since the play takes slightly more than 2 seconds to run when creating
> with reverse and slightly under 2 seconds when creating without reverse I
> don't see why this should ever overload anything but I will freely admit I
> am not all that familiar with the way DNS is handled. If FreeIPA is sending
> the entire zone file for every update and it all has to be written to the
> DB then I can see why that would be an issue. I could kill the replication
> agreements, load the rest of the entries, then re-add the agreements so the
> zone only needs to be transferred once. But it's still a bit concerning due
> to the scenario I described above.
>
> If we have a power outage and need to boot ~9k machines, all of which will
> run:
>
> ipa-client-install -U -q -p <service account for adding hosts> \
> -w <some really secure password> \
> --domain=example.com \
> --server=ipa1.example.com \
> --server=ipa2.example.com \
> --server=ipa3.example.com \
> --server=ipa4.example.com \
> --force-join \
> --enable-dns-updates \
> --ssh-trust-dns \
> --automount-location=<appropriate map>
>
> Are we going to see everything fail in a spectacular manner and is there
> anything I can do to mitigate the failure with adding DNS entries as I
> still need to complete the addition and have ~5k per zone left for two
> zones.
>
>

_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org

[Freeipa-users] Re: Scaling and Misc

Reply via email to