Hi Brian, here are some hints what can you do to get more information of of the running `named` process:
https://kb.isc.org/docs/aa-00341 Basically pstack / eu-stack and or gcore (or attaching gdb). It's important to have debugging symbols present, without them it's virtually impossible to debug the issue. I would suggest you fill an issue in our GitLab (gitlab.isc.org <http://gitlab.isc.org/>) and we can continue there. Also please include the information about previous BIND 9 version. Ondrej -- Ondřej Surý (He/Him) ond...@isc.org My working hours and your working hours may be different. Please do not feel obligated to reply outside your normal working hours. > On 25. 7. 2024, at 12:02, Sebby, Brian A. via bind-users > <bind-users@lists.isc.org> wrote: > > I upgraded our DNS servers when the 9.18.28 release came out, and ran into a > problem today that I wanted to know if anyone else had seen or had any > suggestions about how to debug. > We have our DNS configured in a hidden primary configuration, where the > primary has internal and external views and serves and internal and external > copy of one of our domains. The external version is fairly small, while the > internal version is significantly larger. We use the same DNSSEC keys to > sign both versions of the domain. Every once in a while, we have encountered > an issue where the unsigned and signed versions of the domain get out of > sync, which causes this message to appear in our logs (note that I have > modified all of the following log entries to replace our domain with > example.org): > 25-Jul-2024 10:12:32.202 general: error: zone example.org/IN/internal > (signed): receive_secure_serial: not exact > The solution I’ve always been able to follow previously is to comment out > the DNSSEC config options in named.conf, restart named with the zone > unsigned, retransfer the unsigned zone to our secondaries, and then put back > the DNSSEC config options, restart named, and let it re-sign the zone. It > takes a little bit, but normally everything has then gotten back to normal. > Today, however, when I tried to do that, it started to sign the zone – and > then named just hung. It stopped updating any of the log files, stopped > sending any notifies, and stopped returning DNS data of any sort. When I > tried to restart named via systemctl it had to kill the process because named > would not respond. I was able to undo the DNSSEC changes, restart named, and > it continued to work. I tried it again, and named hung once again in the > middle of signing the zone. Throughout all of these restarts, the signed > version of the external zone continued to work normally. > This is frustrating because when named hangs, there are no error messages in > the logs that I can see, and no indication of why it has failed. If I try > running rndc commands locally I get this error: > rndc: recv failed: timed out > Remote servers show a timeout and then I saw this in some of their transfer > logs: > 25-Jul-2024 10:27:01.827 general: info: zone example.org/IN: refresh: > skipping zone transfer as primary A.B.C.D#53 (source E.F.G.H#0) is > unreachable (cached) > I was able to solve that one by sending notifies from the primary after > restarting it without DNSSEC, but I really need to get DNSSEC working again. > The configuration for the zone in named.conf is (and yes, I know I need to > update to dnssec-policy): > view "internal" { > ... > zone "example.org" { > type primary; > file "/path/to/internal/example.org"; > key-directory "/path/to/keys"; > auto-dnssec maintain; > inline-signing yes; > }; > ... > }; > Does anyone have any suggestions for putting named into a debug mode to try > to get more data if it hangs again? I was thinking of turning the DNSSEC > options back on but setting “notify no” so it didn’t try to notify the > secondaries in case all of the notifies and zone transfers going on while it > was signing was part of the problem. > The memory and CPU resources of the system should be sufficient – it’s got 2 > virtual CPUs and 8GB of memory, but it’s not close to using up the memory, > and since it doesn’t have clients, the CPU has never been an issue before. I > tried replicating this issue on our test server but it managed to sign the > zone with no problems – though it doesn’t have as many clients. > I don’t think the new max-records-per-type or max-types-per-name options are > involved as we don’t have any cases where we have that many records with the > same name. > Thanks, > Brian > -- > Brian Sebby (he/him/his) | Lead Systems Engineer > Email: se...@anl.gov | Information Technology Infrastructure > Phone: +1 630.252.9935 | Business Information Services > Cell: +1 630.921.4305 | Argonne National Laboratory > -- > Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from > this list > > ISC funds the development of this software with paid support subscriptions. > Contact us at https://www.isc.org/contact/ for more information. > > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users