On Wed, Oct 20, 2021 at 07:47:30AM +0200, Mischa wrote:

> On 2021-10-20 07:30, Otto Moerbeek wrote:
> > On Tue, Oct 19, 2021 at 09:47:22PM +0200, Martijn van Duren wrote:
> > > On Tue, 2021-10-19 at 19:56 +0200, Otto Moerbeek wrote:
> > > > On Tue, Oct 19, 2021 at 07:49:15PM +0200, Mischa wrote:
> > > > > On 2021-10-15 20:05, Otto Moerbeek wrote:
> > > > > > On Fri, Oct 15, 2021 at 07:47:22PM +0200, Mischa wrote:
> > > > > > > On 2021-10-15 19:42, Otto Moerbeek wrote:
> > > > > > > > On Fri, Oct 15, 2021 at 07:16:55PM +0200, Mischa wrote:
> > > > > > > >
> > > > > > > > > On 2021-10-15 18:27, Otto Moerbeek wrote:
> > > > > > > > > >
> > > > > > > > > > The actual problem (SIGSEGV) happens in the child 
> > > > > > > > > > processes: ktrace the
> > > > > > > > > > children as well: ktrace -di ...
> > > > > > > > > >
> > > > > > > > > >     -Otto
> > > > > > > > >
> > > > > > > > > Thanx Otto.
> > > > > > > > > Below is the the kdump with ktrace -di
> > > > > > > > > It's quite a lot of data but I didn't want to remove 
> > > > > > > > > something that
> > > > > > > > > could
> > > > > > > > > potentially be useful.
> > > > > > > > >
> > > > > > > > > Mischa
> > > > > > > > >
> > > > > > > >
> > > > > > > > The pattern below happens multiple times:
> > > > > > > >
> > > > > > > > A recvfrom of 101 bytes and after that a SIGSEGV.
> > > > > > > >
> > > > > > > > Now we do not know for sure if those two lines are related.
> > > > > > > >
> > > > > > > > I suspect that it is no coincidence that the 101 is one larger 
> > > > > > > > than
> > > > > > > > 100...
> > > > > > > >
> > > > > > > > No other clue yet.
> > > > > > >
> > > > > > > Anything else I can collect.
> > > > > >
> > > > > > You might want to compile and install nsd wit debug symbols info:
> > > > > >
> > > > > >     cd /usr/src/usr.sbin/nsd
> > > > > >     make -f Makefile.bsd-wrapper obj
> > > > > >     make -f Makefile.bsd-wrapper clean
> > > > > >     DEBUG=-g make -f  Makefile.bsd-wrapper
> > > > > >     make -f  Makefile.bsd-wrapper install
> > > > > >
> > > > > >
> > > > > > Then: collect a gdb trace from a running process: install gdb from
> > > > > > ports,
> > > > > > run
> > > > > >     egdb --pid=pidofnsdchild /usr/sbin/nsd
> > > > > >
> > > > > > and wait for the crash.
> > > > > >
> > > > > > But I'm mostly unfamiliar with the nsd code and what has been 
> > > > > > changed
> > > > > > recently.  I's say make sure sthen@ and florian@ see this: move to
> > > > > > bugs@ as I do not know if they read misc@.
> > > > >
> > > > > Thanx Otto.
> > > > >
> > > > > As this is my first time using gdb, I need some assistance.
> > > > >
> > > > > root@name2:~ # ps -aux | grep nsd
> > > > > _nsd     79188  0.0  1.0 101704 86400 ??  Ip      7:31PM    0:00.20 
> > > > > nsd:
> > > > > xfrd (nsd)
> > > > > _nsd     24002  0.0  0.4 37188 37388 ??  Ip      7:31PM    0:00.29 
> > > > > nsd: main
> > > > > (nsd)
> > > > > _nsd     44937  0.0  0.2 37544 18308 ??  Sp      7:45PM    0:00.11 
> > > > > nsd:
> > > > > server 1 (nsd)
> > > > >
> > > > > root@name2:~ # egdb --pid=44937 /usr/sbin/nsd
> > > > > GNU gdb (GDB) 7.12.1
> > > > > Copyright (C) 2017 Free Software Foundation, Inc.
> > > > > License GPLv3+: GNU GPL version 3 or later
> > > > > <http://gnu.org/licenses/gpl.html>
> > > > > This is free software: you are free to change and redistribute it.
> > > > > There is NO WARRANTY, to the extent permitted by law.  Type "show 
> > > > > copying"
> > > > > and "show warranty" for details.
> > > > > This GDB was configured as "x86_64-unknown-openbsd7.0".
> > > > > Type "show configuration" for configuration details.
> > > > > For bug reporting instructions, please see:
> > > > > <http://www.gnu.org/software/gdb/bugs/>.
> > > > > Find the GDB manual and other documentation resources online at:
> > > > > <http://www.gnu.org/software/gdb/documentation/>.
> > > > > For help, type "help".
> > > > > Type "apropos word" to search for commands related to "word"...
> > > > > Reading symbols from /usr/sbin/nsd...(no debugging symbols 
> > > > > found)...done.
> > > > > Attaching to program: /usr/sbin/nsd, process 44937
> > > > > Reading symbols from /usr/lib/libssl.so.50.0...done.
> > > > > Reading symbols from /usr/lib/libcrypto.so.47.0...done.
> > > > > Reading symbols from /usr/lib/libevent.so.4.1...done.
> > > > > Reading symbols from /usr/lib/libc.so.96.1...done.
> > > > > Reading symbols from /usr/libexec/ld.so...done.
> > > > > [Switching to thread 563101]
> > > > > kevent () at /tmp/-:3
> > > > > 3       /tmp/-: No such file or directory.
> > > > >
> > > > > Anything I am missing?
> > > > >
> > > > > Mischa
> > > > >
> > > >
> > > > Do you see a gdb prompt? If so
> > > >
> > > >   continue
> > > >
> > > > should it (and then wait for the crash).
> > > >
> > > > If you still see the crashes, a tcpdump of the traffic to nsd might
> > > > helps as well, I can replay that locally against nsd. I would also
> > > > need your nsd config for that.
> > > >
> > > >         -Otto
> > > >
> > > I did some debugging with Mischa.
> > > 
> > > Unfortunately I misclicked and deleted the backtrace. However, the
> > > problem was that query.c calls add_rrset (query.c:736) from
> > > answer_delegation (query.c:917), where rrset is NULL.
> > > 
> > > When looking in the original query it was always a PTR request to
> > > an IPv6 record. When looking through the file we tried to remove
> > > some likely suspect entries to see if we could pinpoint the root-
> > > cause, but after readding everything it wouldn't crash anymore.
> > > 
> > > Adding a simple comment to the zonefile of the second NS server
> > > yielded the same result: the server won't crash anymore.
> > > 
> > > Mischa is going to monitor the situation to see if the issues
> > > return, but my current best guess is that some weird state got
> > > cached somewhere somehow and got flushed when saving the
> > > zonefile.
> > > 
> > > martijn@
> > > 
> > 
> > Maybe some form of corruption in the zonefile that was remved when
> > saving? Who knows.... Anyway, thanks for taking care.
> 
> Unfortunately our joy was short lived. This morning I noticed a lot of
> Oct 20 07:44:15 name1 nsd[80814]: server 76410 died unexpectedly with status
> 11, restarting
> 
> It looks like there is a potentially fixed in version 4.3.8.
> 
> https://github.com/NLnetLabs/nsd/issues/195
> https://github.com/NLnetLabs/nsd/issues/189
> 
> https://github.com/NLnetLabs/nsd/blob/NSD_4_3_8_REL/doc/ChangeLog
> 23 August 2021: Wouter
> - Fix #189: nsd 4.3.7 crash answer_delegation: Assertion
> `query->delegation_rrset' failed.
> 
> (Thanx Roger!)
> 
> As far as I can tell from the things Martijn found it might be the case.
> 
> Will give that a try and report back.
> 
> Mischa

Are you going to try just the one line fix or the whole of 4.3.8?
I suppose if we want to backport to -stable the one-line fix is preferred.

        -Otto

Reply via email to