On Thu, Jul 06, 2023 at 02:14:09PM +0000, Valdrin MUJA wrote: > I've applied your patch but crashed again. Here it is: > ddb{1}> show panic > *cpu1: kernel diagnostic assertion "refcnt_read(&rt->rt_refcnt) >= 2" failed: > f > ile "/usr/src/sys/net/rtable.c", line 828
This kassert I added seems to be wrong. I copied it from above without thinking enough. Just remove it, updated diff below. I compared your crash 3 and 4 output: TEST1> uvm_fault(0xfffffd826717bcc0, 0x8, 0, 1) -> e kernel: page fault trap, code=0 Stopped at srp_get_locked+0x11: movq 0(%rdi),%rax TID PID UID PRFLAGS PFLAGS CPU COMMAND *225335 47125 0 0 0 1 bgpd 231752 78299 73 0x1100010 0 3 syslogd 344909 6421 0 0x14000 0x200 2 wg_handshake 361415 98860 0 0x14000 0x200 0 reaper SPOKE1> uvm_fault(0xfffffd81d5995878, 0x8, 0, 1) -> e kernel: page fault trap, code=0 Stopped at srp_get_locked+0x11: movq 0(%rdi),%rax TID PID UID PRFLAGS PFLAGS CPU COMMAND 448769 98731 0 0x100002 0 3 sh 350289 69698 73 0x1100010 0 0 syslogd *114462 84824 0 0 0 1 bgpd 256495 50081 0 0x14000 0x200 2 wg_handshake It is interesting that bgpd and wireguard are running in both cases when it crashes. Unfortunately you mail does not include this output for crash 1 and 2. It is printed immediately when the machine crashes. Do you have it in some console history? I see a lot of different workload on your machine. That makes it harder to identify the subsystem that has the bug. I see bgpd(8) and wg(2) doing things with network and routing. Is there something else? What has changed to make these crashes happen? New workload? New machine? Upgrade to 7.3? Was it stable with 7.2? ... Thanks for testing. bluhm Index: net/rtable.c =================================================================== RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtable.c,v retrieving revision 1.82 diff -u -p -r1.82 rtable.c --- net/rtable.c 19 Apr 2023 17:42:47 -0000 1.82 +++ net/rtable.c 6 Jul 2023 15:56:04 -0000 @@ -604,6 +604,11 @@ rtable_insert(unsigned int rtableid, str SRPL_INSERT_HEAD_LOCKED(&rt_rc, &an->an_rtlist, rt, rt_next); prev = art_insert(ar, an, addr, plen); + if (prev == an) { + rw_exit_write(&ar->ar_lock); + /* keep the refcount for rt while it is in an_rtlist */ + return (0); + } if (prev != an) { SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry, rt_next); @@ -689,9 +694,10 @@ rtable_delete(unsigned int rtableid, str npaths++; if (npaths > 1) { - KASSERT(refcnt_read(&rt->rt_refcnt) >= 1); + KASSERT(refcnt_read(&rt->rt_refcnt) >= 2); SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry, rt_next); + rtfree(rt); mrt = SRPL_FIRST_LOCKED(&an->an_rtlist); if (npaths == 2) @@ -703,8 +709,9 @@ rtable_delete(unsigned int rtableid, str if (art_delete(ar, an, addr, plen) == NULL) panic("art_delete failed to find node %p", an); - KASSERT(refcnt_read(&rt->rt_refcnt) >= 1); + KASSERT(refcnt_read(&rt->rt_refcnt) >= 2); SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry, rt_next); + rtfree(rt); art_put(an); leave: @@ -821,12 +828,10 @@ rtable_mpath_reprio(unsigned int rtablei */ rt->rt_priority = prio; } else { - rtref(rt); /* keep rt alive in between remove and insert */ SRPL_REMOVE_LOCKED(&rt_rc, &an->an_rtlist, rt, rtentry, rt_next); rt->rt_priority = prio; rtable_mpath_insert(an, rt); - rtfree(rt); error = EAGAIN; } rw_exit_write(&ar->ar_lock); @@ -839,6 +844,9 @@ rtable_mpath_insert(struct art_node *an, { struct rtentry *mrt, *prt = NULL; uint8_t prio = rt->rt_priority; + + /* increment the refcount for rt while it is in an_rtlist */ + rtref(rt); if ((mrt = SRPL_FIRST_LOCKED(&an->an_rtlist)) == NULL) { SRPL_INSERT_HEAD_LOCKED(&rt_rc, &an->an_rtlist, rt, rt_next);