Hello,
On Fri, May 27, 2022 at 10:33:06AM +0200, Hrvoje Popovski wrote:
> Hi all,
>
> I'm running firewall in production with NET_TASKQ=6 with claudio@ "use
> timeout for rttimer" and bluhm@ "kernel lock in arp" diffs.
> After week or so of running smoothly I've got panic.
thank you for being brave enough to run those bits in production.
</snip>
> bcbnfw1# uvm_fault(0xffffffff823c6ac0, 0x10, 0, 1) -> e
> kernel: page fault trap, code=0
> Stopped at pf_state_export+0x4e: movq 0x10(%rax),%rcx
according to registers below rax is 0, we die because
of NULL pointer dereference.
> TID PID UID PRFLAGS PFLAGS CPU COMMAND
> *414231 37466 0 0x14000 0x200 3 softnet
> 180795 96693 0 0x14000 0x200 2 softnet
> 39487 54182 0 0x14000 0x200 0 softnet
> 221352 95757 0 0x14000 0x200 4 softnet
> 252845 32137 0 0x14000 0x200 1 softnet
> 294301 63695 0 0x14000 0x200 5 softnet
> pf_state_export(fffffd80611313c8,fffffd8877492ac0) at pf_state_export+0x4e
> pfsync_sendout() at pfsync_sendout+0x5e4
> pfsync_update_state(fffffd887df852b8) at pfsync_update_state+0x15b
> pf_test(2,1,ffff800000d48000,ffff800020b23a08) at pf_test+0xd53
> ip_input_if(ffff800020b23a08,ffff800020b23a14,4,0,ffff800000d48000) at
> ip_input_if+0xcd
> ipv4_input(ffff800000d48000,fffffd80774a4000) at ipv4_input+0x39
> ether_input(ffff800000d48000,fffffd80774a4000) at ether_input+0x3ad
> carp_input(ffff800000d64000,fffffd80774a4000,5e000115) at carp_input+0x196
> ether_input(ffff800000d64000,fffffd80774a4000) at ether_input+0x1d9
> vlan_input(ffff800000b9f000,fffffd80774a4000,ffff800020b23c3c) at
> vlan_input+0x23d
> ether_input(ffff800000b9f000,fffffd80774a4000) at ether_input+0x85
> if_input_process(ffff800000493048,ffff800020b23cd8) at if_input_process+0x6f
> ifiq_process(ffff800000491b00) at ifiq_process+0x69
> taskq_thread(ffff800000036500) at taskq_thread+0x11a
> end trace frame: 0x0, count: 1
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports. Insufficient info makes it difficult to find and fix bugs.
> ddb{3}>
>
according to call stack we die somewhere here:
1192
1193 memset(sp, 0, sizeof(struct pfsync_state));
1194
1195 /* copy from state key */
1196 sp->key[PF_SK_WIRE].addr[0] = st->key[PF_SK_WIRE]->addr[0];
1197 sp->key[PF_SK_WIRE].addr[1] = st->key[PF_SK_WIRE]->addr[1];
1198 sp->key[PF_SK_WIRE].port[0] = st->key[PF_SK_WIRE]->port[0];
1199 sp->key[PF_SK_WIRE].port[1] = st->key[PF_SK_WIRE]->port[1];
1200 sp->key[PF_SK_WIRE].rdomain =
htons(st->key[PF_SK_WIRE]->rdomain);
1201 sp->key[PF_SK_WIRE].af = st->key[PF_SK_WIRE]->af;
looks like state key bound to st might be gone (st->key[] == NULL).
I'll take closer look later today.
thanks and
regards
sashan