On Mon, Apr 13, 2020 at 12:05:10PM +0100, Richard Chivers wrote: > Thanks. Please see my comments below. > > On Mon, 13 Apr 2020, 10:18 Remi Locherer, <remi.loche...@relo.ch> wrote: > > > Hi Richard, > > > > On Mon, Apr 13, 2020 at 08:38:31AM +0100, Richard Chivers wrote: > > > We have been having a strange issue, whereby OSPF stops updating > > properly. > > > > > > We can see an entry for an ip route in the database but it is not in the > > > kernel routing table, and when it is the DR, other routers then do not > > have > > > the route at all. > > > > > > We are seeing this across multiple boxes. We have 10+ ospf speakers, and > > > seem to see the issue at different times. > > > > > > The problem starts with: > > > > > > ospfd[6960]: recv_db_description: neighbor ID x.x.x.x: seq num mismatch, > > > bad flags > > > > The neighbor sent a db desc with the master flag set differently than what > > this ospfd instance recorded before for that particular neighbor. > > > > See 2nd last item on page 100 of RFC 2328: > > https://tools.ietf.org/html/rfc2328#page-100 > > > Thanks, should the routers just recover then from this scenario even if it > was happening due to lost packets, CPU pause etc.
I think so. But it may take quite a while. It might also be an bug in ospfd or in another implementation. > > > > > > ospfd[30114]: lsa_check: bad age > > > > > > And these just then continue, until we restart ospfd, and the problem > > > appears to go away. > > > > Is the neighbor also OpenBSD ospfd or something else? > > > > > > > The neighbors complaining are openbsd, we do however have a couple of > differeent neighbors that are not openbsd. What software (incl. version please) are these neighbors that make ospfd log these messages? It should be easy to identify those neighbors since you see the neighbor ID in the log message. > > > > > > We are running some old routers on 5.8 and some new on 6.4. We appreciate > > > that we need to upgrade the 5.8 routers but we are keen to stabalise > > things > > > first. > > > > > > Having looked at the source, we can see the line generating the message: > > > > > > case NBR_STA_FULL: > > > if (dd_hdr.bits & OSPF_DBD_I || > > > !(dd_hdr.bits & OSPF_DBD_MS) == !nbr->dd_master) { > > > log_warnx("recv_db_description: neighbor ID %s: "...... > > > > > > Could anyone explain the scenario in which this would be expected, so we > > > can see how to resolve the issue. > > > > Please share a pcap file with the OSPF packets. With this we can better > > understand how this happens and where to look for a bug. > > > We will take a look at this. May be difficult to catch due to regularity. > Every 30 mins or so. Then just let tcpdump run for a few hours. If you filter for ospf it should not need too much diskspace. And don't forget to raise the snaplen. Eg: tcpdump -i em1 -s 1500 -w /tmp/ospf.pcap proto ospf > > > > > > We run some of our routers under VMware, could some sort of OS pause > > cause > > > this? > > > > Maybe if the router is not getting all OSPF packets because of this. > > > > Remi > >