On Пан, 07 жні 2023, Harry G Coin via FreeIPA-users wrote:
This January, Mark Potter first experienced what is now killing my new freeipa setup as well.   Once the primary server on which dns records are being programatically added sees named-pcks11 putting "writeback to ldap failed" in the log: 'systemctl' reports nothing amiss yet ns-slapd becomes entirely unresponsive and all that depend upon dirsrv hangs/times out.

There is nothing in any of the log entries detailing failure, even with ample memory and storage space and processor resources.   There are perhaps 10 hosts in the little lab sandbox, while the system does service DNS requests from the public internet.

I looked through the 'changelogs' and could find nothing that seems to address this, is there a 'work-around' known?  I'd like to avoid using 'bleeding edge' versions of freeipa.

The details of Mark's discussion are here: 
https://lists.fedoraproject.org/archives/list/freeipa-users@lists.fedorahosted.org/thread/WZKFEUAYYABTF5BIFQTNGPIYEJTXXJPY/

What's common to his and my case is that the problem arises during an 'initial programmatic load' of many DNS records in a few dozen zones (often ~200 / in a *.arpa domain),  even when slowed to not more than one every 8 seconds.  But only when at least one replication agreement exists.   It's possible the fact that dnssec is enabled and so 'adding a record' generates further signature set computing activity either provides ldap/bind9 server load that exposes the race-condition-feeling problem or (less likely) is directly involved in the cause.

I agree with Mark's quote:  "Are we going to see everything fail in a spectacular manner and is there anything I can do to mitigate the failure with adding DNS entries as I still need to complete the addition and have ~5k per zone left for two zones."   He 'worked around' the problem by adding a record at a time and slowly (without knowing what 'slowly' means in seconds). I'm unable to guess what 'slowly' means in my setup, but it appears to be more than 8 seconds / update, so fails for me.

Maybe some kind of 'flow control' is required, or a some way to determine whether all replication activity and triggered further 'signature computation' dns activity has concluded after writing a new dns record?

The only 'way out' is to restart ipa.  Of some interest, when that is attempted after these failures, ns-slapd uses 100% CPU, dirsrv hang for several minutes, often longer than the systemctl shutdown / reboot process allows before forcibly killing the process.   This is 100% repeatable.  After a few dozen records are loaded, ns-slapd just stops responding.   The program adding the records is using the python ldap2 interface native on the local host.

If ns-slapd uses 100% CPU, you might be able to get the stack trace out
of those processes. Install debuginfo packages and try to get a pstack.
This should help to narrow it down:

 # pstack $(pgrep ns-slapd)  > ns-slapd log

pstack is part of gdb package.

There are few bugs in 389-ds that were fixed relatively recently (not in
production packages yet) which had been preventing it to respond on
incoming LDAP connections.


Some further details:

Aug 06 15:58:58 registry1.1.somedomain.com named-pkcs11[1497]: zone 177.168.192.in-addr.arpa/IN: sending notifies (serial 1691355538) Aug 06 15:58:58 registry1.1.somedomain.com myprogram.py[35998]: rssmain: ipa command dnsrecord_show: Summary: None args: ('7.somedomain.com.', 'email19') Aug 06 15:59:01 registry1.1.somedomain.com named-pkcs11[1497]: checkhints: unable to get root NS rrset from cache: not found Aug 06 15:59:06 registry1.1.somedomain.com myprogram.py[35998]: rssmain: ipa command dnsrecord_add: Summary: None Aug 06 15:59:07 registry1.1.somedomain.com myprogram.py[35998]: rssmain: ipa command dnsrecord_find: Summary: None args: ('177.168.192.in-addr.arpa.',), 144 records total Aug 06 15:59:07 registry1.1.somedomain.com myprogram.py[35998]: rssmain: ipa command dnsrecord_show: Summary: None args: ('177.168.192.in-addr.arpa.', '147') Aug 06 15:59:07 registry1.1.somedomain.com myprogram.py[35998]: rssmain: ipa command dnsrecord_find: Summary: None args: ('6.0.c.0.0.0.0.0.0.0.0.1.0.0.c.f.ip6.arpa.',), 22 records total Aug 06 15:59:07 registry1.1.somedomain.com ns-slapd[1434]: [06/Aug/2023:15:59:07.848595593 -0500] - WARN - content-sync-plugin - sync_update_persist_betxn_pre_op - DB retried operation targets "changenumber=59921,cn=changelog" (op=0x7f6b6fc1f400 idx_pl=0) => op not c> Aug 06 15:59:16 registry1.1.somedomain.com myprogram.py[35998]: rssmain: ipa command dnsrecord_add: Summary: None Aug 06 15:59:16 registry1.1.somedomain.com myprogram.py[35998]: rssmain: ipa command dnsrecord_show: Summary: None args: ('7.somedomain.com.', 'email19') Aug 06 15:59:32 registry1.1.somedomain.com named-pkcs11[1497]: zone 7.somedomain.com/IN (unsigned): serial (1691355547) write back to LDAP failed Aug 06 15:59:33 registry1.1.somedomain.com named-pkcs11[1497]: zone 7.somedomain.com/IN (signed): serial 1691355547 (unsigned 1691355547) Aug 06 15:59:33 registry1.1.somedomain.com named-pkcs11[1497]: zone 7.somedomain.com/IN (signed): sending notifies (serial 1691355547) Aug 06 16:01:01 registry1.1.somedomain.com CROND[52537]: (root) CMD (run-parts /etc/cron.hourly) Aug 06 16:01:01 registry1.1.somedomain.com run-parts[52540]: (/etc/cron.hourly) starting 0anacron Aug 06 16:01:01 registry1.1.somedomain.com run-parts[52546]: (/etc/cron.hourly) finished 0anacron Aug 06 16:02:05 registry1.1.somedomain.com sssd_kcm[19097]: Shutting down (status = 0) Aug 06 16:02:05 registry1.1.somedomain.com systemd[1]: sssd-kcm.service: Succeeded. Aug 06 16:02:20 registry1.1.somedomain.com systemd[1]: loclibrary.mount: Succeeded. Aug 06 16:02:20 registry1.1.somedomain.com systemd[1]: qflibrary.mount: Succeeded. Aug 06 16:07:37 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1 Aug 06 16:07:37 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1 Aug 06 16:08:34 registry1.1.somedomain.com named-pkcs11[1497]: zone 9.somedomain.com/IN (signed): Key 9.somedomain.com/RSASHA256/60467 missing or inactive and has no replacement: retaining signatures. Aug 06 16:08:34 registry1.1.somedomain.com named-pkcs11[1497]: zone 9.somedomain.com/IN (signed): sending notifies (serial 1691350608) Aug 06 16:09:08 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1 Aug 06 16:09:08 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1 Aug 06 16:09:18 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1 Aug 06 16:09:18 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1 Aug 06 16:09:30 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1 Aug 06 16:09:30 registry1.1.somedomain.com sssd_be[659]: GSSAPI client step 1
Aug 06 16:11:05 registry1.1.somedomain.com sssd_be[659]: Backend is offline
Aug 06 16:11:38 registry1.1.somedomain.com named-pkcs11[1497]: zone 3.somedomain.com/IN (signed): Key 3.somedomain.com/RSASHA256/18257 missing or inactive and has no replacement: retaining signatures. Aug 06 16:11:38 registry1.1.somedomain.com named-pkcs11[1497]: zone 3.somedomain.com/IN (signed): sending notifies (serial 1691350754) Aug 06 16:12:37 registry1.1.somedomain.com named-pkcs11[1497]: zone 7.somedomain.com/IN (signed): Key 7.somedomain.com/RSASHA256/40511 missing or inactive and has no replacement: retaining signatures. Aug 06 16:12:37 registry1.1.somedomain.com named-pkcs11[1497]: zone 7.somedomain.com/IN (signed): sending notifies (serial 1691355548) Aug 06 16:20:01 registry1.1.somedomain.com CROND[52573]: (root) CMD (test -f /var/lock/subsys/ods-enforcerd && kill -s SIGHUP `cat /var/run/opendnssec/enforcerd.pid` > /dev/null 2> /dev/null) Aug 06 16:29:21 registry1.1.somedomain.com systemd[1]: Stopping 389 Directory Server 1-somedomain-COM.... Aug 06 16:29:21 registry1.1.somedomain.com ns-slapd[1434]: [06/Aug/2023:16:29:21.659348276 -0500] - INFO - op_thread_cleanup - slapd shutting down - signaling operation threads - op stack size 0 max work q size 600 max work q stack size 4 Aug 06 16:29:21 registry1.1.somedomain.com ns-slapd[1434]: [06/Aug/2023:16:29:21.735442997 -0500] - INFO - slapd_daemon - slapd shutting down - waiting for 16 threads to terminate Aug 06 16:29:22 registry1.1.somedomain.com ns-slapd[1434]: [06/Aug/2023:16:29:22.278851959 -0500] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meToregistry2.1.somedomain.com" (registry2:389): Attempting to release replica, but unable to receive endRe> Aug 06 16:29:22 registry1.1.somedomain.com ns-slapd[1434]: [06/Aug/2023:16:29:22.307800436 -0500] - WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=meToregistry2.1.somedomain.com" (registry2:389): Unable to receive the response for a startReplication ext> Aug 06 16:34:03 registry1.1.somedomain.com ods-enforcerd[2124]: [enforcer] update zone: _domainkey.anotherdomain.com Aug 06 16:34:04 registry1.1.somedomain.com ods-enforcerd[2124]: [signconf_cmd] performing signconf for zone _domainkey.anotherdomain.com Aug 06 16:34:04 registry1.1.somedomain.com ods-enforcerd[2124]: [signconf_cmd] signconf done for zone _domainkey.anotherdomain.com, notifying signer Aug 06 16:34:04 registry1.1.somedomain.com systemd[1]: Started IPA OpenDNSSEC Signer replacement. Aug 06 16:34:07 registry1.1.somedomain.com ipa-ods-exporter[52623]: ipa-ods-exporter: INFO     To increase debugging set debug=True in dns.conf See default.conf(5) for details Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]: Traceback (most recent call last): Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:   File "/usr/lib/python3.6/site-packages/ipapython/ipaldap.py", line 1085, in error_handler Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:     yield Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:   File "/usr/lib/python3.6/site-packages/ipapython/ipaldap.py", line 1257, in gssapi_bind Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:     '', auth_tokens, server_controls, client_controls) Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:   File "/usr/lib64/python3.6/site-packages/ldap/ldapobject.py", line 476, in sasl_interactive_bind_s Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:     return self._ldap_call(self._l.sasl_interactive_bind_s,who,auth,RequestControlTuples(serverctrls),RequestControlTuples(clientctrls),sasl_flags) Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:   File "/usr/lib64/python3.6/site-packages/ldap/ldapobject.py", line 340, in _ldap_call Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:     reraise(exc_type, exc_value, exc_traceback) Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:   File "/usr/lib64/python3.6/site-packages/ldap/compat.py", line 46, in reraise Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:     raise exc_value Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:   File "/usr/lib64/python3.6/site-packages/ldap/ldapobject.py", line 324, in _ldap_call Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]:     result = func(*args,**kwargs) Aug 06 16:34:11 registry1.1.somedomain.com ipa-ods-exporter[52623]: ldap.SERVER_DOWN: {'result': -1, 'desc': "Can't contact LDAP server", 'errno': 111, 'ctrls': [], 'info': 'Connection refused'}



Name         : ipa-server-common
Version      : 4.9.11
Release      : 6.module_el8.8.0+3593+1210bde8.alma.1
Architecture : noarch
Size         : 621 k
Source       : ipa-4.9.11-6.module_el8.8.0+3593+1210bde8.alma.1.src.rpm
Repository   : appstream
Summary      : Common files used by IPA server
...

Name         : ipa-server-dns
Version      : 4.9.11
Release      : 6.module_el8.8.0+3593+1210bde8.alma.1
Architecture : noarch
Size         : 197 k
Source       : ipa-4.9.11-6.module_el8.8.0+3593+1210bde8.alma.1.src.rpm
Repository   : appstream

[root@registry2 ~]# uname -r
4.18.0-477.15.1.el8_8.x86_64





_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue



--
/ Alexander Bokovoy
Sr. Principal Software Engineer
Security / Identity Management Engineering
Red Hat Limited, Finland
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to