On 6/13/17 2:16 PM, Ben Greear wrote: > On 06/09/2017 02:25 PM, Eric Dumazet wrote: >> On Fri, 2017-06-09 at 07:27 -0600, David Ahern wrote: >>> On 6/8/17 11:55 PM, Cong Wang wrote: >>>> On Thu, Jun 8, 2017 at 2:27 PM, Ben Greear <gree...@candelatech.com> >>>> wrote: >>>>> >>>>> As far as I can tell, the patch did not help, or at least we still >>>>> reproduce >>>>> the >>>>> crash easily. >>>> >>>> netlink dump is serialized by nlk->cb_mutex so I don't think that >>>> patch makes any sense w.r.t race condition. >>> >>> From what I can see fn_sernum should be accessed under table lock, so >>> when saving and checking it during a walk make sure it the lock is held. >>> That has nothing to do with the netlink dump, but the table changing >>> during a walk. >> >> >> Yes, your patch makes total sense, of course. > > I guess someone should go ahead and make an official patch and > submit it, even if it doesn't fix my problem.
I can do that; was hoping to root cause the problem first. > >>>>> (gdb) l *(fib6_walk_continue+0x76) >>>>> 0x188c6 is in fib6_walk_continue >>>>> (/home/greearb/git/linux-2.6/net/ipv6/ip6_fib.c:1593). >>>>> 1588 if (fn == w->root) >>>>> 1589 return 0; >>>>> 1590 pn = fn->parent; >>>>> 1591 w->node = pn; >>>>> 1592 #ifdef CONFIG_IPV6_SUBTREES >>>>> 1593 if (FIB6_SUBTREE(pn) == fn) { >>>> >>>> Apparently fn->parent is NULL here for some reason, but >>>> I don't know if that is expected or not. If a simple NULL check >>>> is not enough here, we have to trace why it is NULL. >>> >>> From my understanding, parent should not be null hence the attempts to >>> fix access to table nodes under a lock. ie., figuring out why it is null >>> here. > > If someone has more suggestions, I'll be happy to test. I have looked at the code again and nothing is jumping out. Will look again later today.