On Wed, Oct 20, 2021 at 07:47:30AM +0200, Mischa wrote: > On 2021-10-20 07:30, Otto Moerbeek wrote: > > On Tue, Oct 19, 2021 at 09:47:22PM +0200, Martijn van Duren wrote: > > > On Tue, 2021-10-19 at 19:56 +0200, Otto Moerbeek wrote: > > > > On Tue, Oct 19, 2021 at 07:49:15PM +0200, Mischa wrote: > > > > > On 2021-10-15 20:05, Otto Moerbeek wrote: > > > > > > On Fri, Oct 15, 2021 at 07:47:22PM +0200, Mischa wrote: > > > > > > > On 2021-10-15 19:42, Otto Moerbeek wrote: > > > > > > > > On Fri, Oct 15, 2021 at 07:16:55PM +0200, Mischa wrote: > > > > > > > > > > > > > > > > > On 2021-10-15 18:27, Otto Moerbeek wrote: > > > > > > > > > > > > > > > > > > > > The actual problem (SIGSEGV) happens in the child > > > > > > > > > > processes: ktrace the > > > > > > > > > > children as well: ktrace -di ... > > > > > > > > > > > > > > > > > > > > -Otto > > > > > > > > > > > > > > > > > > Thanx Otto. > > > > > > > > > Below is the the kdump with ktrace -di > > > > > > > > > It's quite a lot of data but I didn't want to remove > > > > > > > > > something that > > > > > > > > > could > > > > > > > > > potentially be useful. > > > > > > > > > > > > > > > > > > Mischa > > > > > > > > > > > > > > > > > > > > > > > > > The pattern below happens multiple times: > > > > > > > > > > > > > > > > A recvfrom of 101 bytes and after that a SIGSEGV. > > > > > > > > > > > > > > > > Now we do not know for sure if those two lines are related. > > > > > > > > > > > > > > > > I suspect that it is no coincidence that the 101 is one larger > > > > > > > > than > > > > > > > > 100... > > > > > > > > > > > > > > > > No other clue yet. > > > > > > > > > > > > > > Anything else I can collect. > > > > > > > > > > > > You might want to compile and install nsd wit debug symbols info: > > > > > > > > > > > > cd /usr/src/usr.sbin/nsd > > > > > > make -f Makefile.bsd-wrapper obj > > > > > > make -f Makefile.bsd-wrapper clean > > > > > > DEBUG=-g make -f Makefile.bsd-wrapper > > > > > > make -f Makefile.bsd-wrapper install > > > > > > > > > > > > > > > > > > Then: collect a gdb trace from a running process: install gdb from > > > > > > ports, > > > > > > run > > > > > > egdb --pid=pidofnsdchild /usr/sbin/nsd > > > > > > > > > > > > and wait for the crash. > > > > > > > > > > > > But I'm mostly unfamiliar with the nsd code and what has been > > > > > > changed > > > > > > recently. I's say make sure sthen@ and florian@ see this: move to > > > > > > bugs@ as I do not know if they read misc@. > > > > > > > > > > Thanx Otto. > > > > > > > > > > As this is my first time using gdb, I need some assistance. > > > > > > > > > > root@name2:~ # ps -aux | grep nsd > > > > > _nsd 79188 0.0 1.0 101704 86400 ?? Ip 7:31PM 0:00.20 > > > > > nsd: > > > > > xfrd (nsd) > > > > > _nsd 24002 0.0 0.4 37188 37388 ?? Ip 7:31PM 0:00.29 > > > > > nsd: main > > > > > (nsd) > > > > > _nsd 44937 0.0 0.2 37544 18308 ?? Sp 7:45PM 0:00.11 > > > > > nsd: > > > > > server 1 (nsd) > > > > > > > > > > root@name2:~ # egdb --pid=44937 /usr/sbin/nsd > > > > > GNU gdb (GDB) 7.12.1 > > > > > Copyright (C) 2017 Free Software Foundation, Inc. > > > > > License GPLv3+: GNU GPL version 3 or later > > > > > <http://gnu.org/licenses/gpl.html> > > > > > This is free software: you are free to change and redistribute it. > > > > > There is NO WARRANTY, to the extent permitted by law. Type "show > > > > > copying" > > > > > and "show warranty" for details. > > > > > This GDB was configured as "x86_64-unknown-openbsd7.0". > > > > > Type "show configuration" for configuration details. > > > > > For bug reporting instructions, please see: > > > > > <http://www.gnu.org/software/gdb/bugs/>. > > > > > Find the GDB manual and other documentation resources online at: > > > > > <http://www.gnu.org/software/gdb/documentation/>. > > > > > For help, type "help". > > > > > Type "apropos word" to search for commands related to "word"... > > > > > Reading symbols from /usr/sbin/nsd...(no debugging symbols > > > > > found)...done. > > > > > Attaching to program: /usr/sbin/nsd, process 44937 > > > > > Reading symbols from /usr/lib/libssl.so.50.0...done. > > > > > Reading symbols from /usr/lib/libcrypto.so.47.0...done. > > > > > Reading symbols from /usr/lib/libevent.so.4.1...done. > > > > > Reading symbols from /usr/lib/libc.so.96.1...done. > > > > > Reading symbols from /usr/libexec/ld.so...done. > > > > > [Switching to thread 563101] > > > > > kevent () at /tmp/-:3 > > > > > 3 /tmp/-: No such file or directory. > > > > > > > > > > Anything I am missing? > > > > > > > > > > Mischa > > > > > > > > > > > > > Do you see a gdb prompt? If so > > > > > > > > continue > > > > > > > > should it (and then wait for the crash). > > > > > > > > If you still see the crashes, a tcpdump of the traffic to nsd might > > > > helps as well, I can replay that locally against nsd. I would also > > > > need your nsd config for that. > > > > > > > > -Otto > > > > > > > I did some debugging with Mischa. > > > > > > Unfortunately I misclicked and deleted the backtrace. However, the > > > problem was that query.c calls add_rrset (query.c:736) from > > > answer_delegation (query.c:917), where rrset is NULL. > > > > > > When looking in the original query it was always a PTR request to > > > an IPv6 record. When looking through the file we tried to remove > > > some likely suspect entries to see if we could pinpoint the root- > > > cause, but after readding everything it wouldn't crash anymore. > > > > > > Adding a simple comment to the zonefile of the second NS server > > > yielded the same result: the server won't crash anymore. > > > > > > Mischa is going to monitor the situation to see if the issues > > > return, but my current best guess is that some weird state got > > > cached somewhere somehow and got flushed when saving the > > > zonefile. > > > > > > martijn@ > > > > > > > Maybe some form of corruption in the zonefile that was remved when > > saving? Who knows.... Anyway, thanks for taking care. > > Unfortunately our joy was short lived. This morning I noticed a lot of > Oct 20 07:44:15 name1 nsd[80814]: server 76410 died unexpectedly with status > 11, restarting > > It looks like there is a potentially fixed in version 4.3.8. > > https://github.com/NLnetLabs/nsd/issues/195 > https://github.com/NLnetLabs/nsd/issues/189 > > https://github.com/NLnetLabs/nsd/blob/NSD_4_3_8_REL/doc/ChangeLog > 23 August 2021: Wouter > - Fix #189: nsd 4.3.7 crash answer_delegation: Assertion > `query->delegation_rrset' failed. > > (Thanx Roger!) > > As far as I can tell from the things Martijn found it might be the case. > > Will give that a try and report back. > > Mischa
Are you going to try just the one line fix or the whole of 4.3.8? I suppose if we want to backport to -stable the one-line fix is preferred. -Otto