On Fri 09/06/2017 12:07, Martin Pieuchot wrote: > On 08/06/17(Thu) 20:38, Björn Ketelaars wrote: > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote: > > > On 07/06/17(Wed) 09:43, Björn Ketelaars wrote: > > > > On Sat 03/06/2017 08:44, Björn Ketelaars wrote: > > > > > > > > > > Reverting back to the previous kernel fixed the issue above. > > > > > Question: can > > > > > someone give a hint on how to track this issue? > > > > > > > > After a bit of experimenting I'm able to reproduce the problem. Summary > > > > is > > > > that queueing in pf and use of a current (after May 30), multi processor > > > > kernel (bsd.mp from snapshots) causes these specific watchdog timeouts > > > > followed by a system freeze. > > > > > > > > Issue is 'gone' when: > > > > 1.) using an older kernel (before May 30); > > > > 2.) removal of queueing statements from pf.conf. Included below the > > > > specific > > > > snippet; > > > > 3.) switch from MP kernel to SP kernel. > > > > > > > > New observation is that while queueing, using a MP kernel, the download > > > > bandwidth is only a fraction of what is expected. Exchanging the MP > > > > kernel > > > > with a SP kernel restores the download bandwidth to expected level. > > > > > > > > I'm guessing that this issue is related to recent work on PF? > > > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP work > > > in the network stack. > > > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined and > > > see if the resulting kernel enters ddb(4) instead of freezing. > > > > > > Thanks, > > > Martin > > > > Thanks for the hint! It helped in entering ddb. I collected a bit of output, > > which you can find below. If I read the trace correctly the crash is related > > to line 1750 of sys/dev/ic/re.c: > > > > d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF); > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and > tell us if you can reproduce the freeze or if the kernel enters ddb(4)? > > Another question, how often do you see "watchdog timeout" messages? > > Index: re.c > =================================================================== > RCS file: /cvs/src/sys/dev/ic/re.c,v > retrieving revision 1.201 > diff -u -p -r1.201 re.c > --- re.c 24 Jan 2017 03:57:34 -0000 1.201 > +++ re.c 9 Jun 2017 10:04:43 -0000 > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp) > s = splnet(); > printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname); > > - re_txeof(sc); > - re_rxeof(sc); > - > re_init(ifp); > > splx(s);
The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before. ddb Output is included below. With your diff the number of timeout messages decreased from 9 to 2 before entering ddb. ddb{1}> show panic the kernel did not panic ddb{1}> machine ddbcpu 0 Stopped at db_enter+0x9: leave ddb{0}> trace db_enter(ffffffff8196b640,200,ffff8000004d1600,10,ffff8000220e9938,282) at db_e nter+0x9 x86_ipi_handler(ffff800000074010,ffffffff819e7160,102860,c,4,1813a9522) at x86_ ipi_handler+0x85 Xresume_lapic_ipi() at Xresume_lapic_ipi+0x1c --- interrupt --- ___mp_lock(ffffffff819e7160,ffffffff8187acf0,1,1,ffff8000220a38c0,ffff8000220a7 ab0) at ___mp_lock+0x4a ___mp_acquire_count(ffffffff819e7160,1,ffff8000220a38c0,ffff8000220a7ab0,ffffff 011da1e1c8,ffffffff81019942) at ___mp_acquire_count+0x33 mi_switch(ffffff011b4a93c0,ffffffff818bf7e7,9cd,ffff8000220e9bb0,ffff8000220a7a b0,ffff8000220e9ba0) at mi_switch+0x22b sleep_finish(ffff8000220e9bb0,1,118,ffff8000220e9bb0,ffffff011b4a93c0,ffffffff8 18bf7e7) at sleep_finish+0xc2 tsleep(ffffff011b4a93c0,118,ffffffff818bf7e7,9cd,0,40) at tsleep+0x164 kqueue_scan(ffffff011b4a93c0,40,143d48714800,ffff8000220e9dd0,ffff8000220a7ab0, ffff8000220e9de4) at kqueue_scan+0x15b sys_kevent(ffff8000220a7ab0,ffff8000220e9e60,ffff8000220e9eb0,ffff8000220a7ab0, 48,1) at sys_kevent+0x2f6 syscall() at syscall+0x29f --- syscall (number 72) --- end of kernel end trace frame: 0x143d48714800, count: -11 0x143c524f280a: ddb{0}> machine ddbcpu 1 Stopped at re_encap+0x24d: movl 0(%r15),%eax ddb{1}> trace re_encap(ffff800000084000,dd,ffffff00d6e03f00,ffff8000000842c0,400,ffff80000008 4000) at re_encap+0x24d re_start(ffff8000000842c0,7,ffffff00d6dd7200,ffff800000084338,ffff8000000842c0, ffffffff81091d70) at re_start+0x8c ifq_serialize(ffff8000000842c0,ffff800000084360,ffff800000084090,ffffff00d6a515 00,ffff8000000842c0,0) at ifq_serialize+0xf8 if_enqueue(ffff800000084090,ffffff00d6a51500,ffff80000001dc80,ffffff00d6a51500, 37,2) at if_enqueue+0x90 ether_output(ffff800000084090,ffffff00d6dd7200,ffff80000001dc80,ffffff011b3f49a 0,ffff800000933400,ffff80000001dc80) at ether_output+0x1d6 ip_output(ffffff00d6dd7200,0,ffff800022032cc0,1,0,0) at ip_output+0x8a1 ip_forward(ffffff00d6dd7200,ffff8000002a4090,0,0,8244c78a,8244c78a) at ip_forwa rd+0x1e7 ipv4_input(ffff8000002a4090,ffffff00d6dd7200,800,800,ffffff0009d00200,ffff80000 02a4090) at ipv4_input+0x4f7 ether_input(ffff8000002a4090,ffffff00d6dd7200,0,ffffffff81221640,ffff8000002a42 88,ffff8000002a4090) at ether_input+0xbd if_input_process(2,ffff800022032eb0,0,0,ffff800022032eb0,ffff800000019080) at i f_input_process+0xfa taskq_thread(ffff800000019080,2a2,ffff800000019080,ffffffff81524ec0,0,ffff80002 2032f10) at taskq_thread+0x79 end trace frame: 0x0, count: -11 ddb{1}> boot reboot panic: rw_enter: netlock locking against myself Stopped at db_enter+0x9: leave TID PID UID PRFLAGS PFLAGS CPU COMMAND 18657 98624 73 0x100010 0x80 0 syslogd *294902 94762 0 0x14000 0x200 1 softnet db_enter(ffffffff8173ec8e,ffff800022032130,10,ffff800022032110,282,8) at db_ent er+0x9 panic(ffffffff8172c108,14,ffff8000ffffed50,ffffffff8197d750,ffff8000ffffed54,ff ff800022032240) at panic+0x102 _rw_enter(ffffffff8197d750,1,ffff80002202e000,0,ffffffff819685a0,ffff8000220324 18) at _rw_enter+0x20f if_downall(ffffffff819685a0,ffff800022032418,4804,ffff8000220323b8,82fd78f8e65d 236c,0) at if_downall+0x2d boot(4804,ffffffff81945ea0,4804,ffffffff819685a0,ffff800022032418,4804) at boot +0xeb reboot(4804,5,0,ffff8000220323f0,ffff8000220323b8,d) at reboot+0x37 db_boot_reboot_cmd(ffffffff811c1c2d,0,ffffffffffffffff,ffff800022032420,0,82fd7 8f8e65d236c) at db_boot_reboot_cmd+0x22 db_command(ffffffff819685a0,ffffffff819681e0,ffff800022032690,6,0,78) at db_com mand+0x13a db_command_loop(82fd78f8e65d236c,0,ffffffff8143247f,ffff80002202e000,0,ffff8000 22032690) at db_command_loop+0xb0 db_trap(6,0,7,82fd78f8e65d236c,81186c47,ffffffff8166e3a4) at db_trap+0xf4 db_ktrap(6,0,ffff800022032690,1,ffff800022032690,ffff8000ffffed50) at db_ktrap+ 0x10f trap() at trap+0x16d --- trap (number 6) --- re_encap(ffff800000084000,dd,ffffff00d6e03f00,ffff8000000842c0,400,ffff80000008 4000) at re_encap+0x24d re_start(ffff8000000842c0,7,ffffff00d6dd7200,ffff800000084338,ffff8000000842c0, ffffffff81091d70) at re_start+0x8c end trace frame: 0xffff8000220329c0, count: 0 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb{1}> dmesg OpenBSD 6.1-current (GENERIC.MP) #9: Fri Jun 9 13:42:45 CEST 2017 b...@gateway.lan:/home/code/src/sys/arch/amd64/compile/GENERIC.MP real mem = 4245995520 (4049MB) avail mem = 4111519744 (3921MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdf16d820 (6 entries) bios0: vendor coreboot version "SageBios_PCEngines_APU-45" date 04/05/2014 bios0: PC Engines APU acpi0 at bios0: rev 0 acpi0: sleep states S0 S1 S3 S4 S5 acpi0: tables DSDT FACP SPCR HPET APIC HEST SSDT SSDT SSDT acpi0: wakeup devices AGPB(S4) HDMI(S4) PBR4(S4) PBR5(S4) PBR6(S4) PBR7(S4) PE2 0(S4) PE21(S4) PE22(S4) PE23(S4) PIBR(S4) UOH1(S3) UOH2(S3) UOH3(S3) UOH4(S3) U OH5(S3) [...] acpitimer0 at acpi0: 3579545 Hz, 32 bits acpihpet0 at acpi0: 14318180 Hz acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: AMD G-T40E Processor, 1000.16 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,C FLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB ,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,IBS,SKINIT,IT SC cpu0: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 1 6-way L2 cache cpu0: 8 4MB entries fully associative cpu0: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative cpu0: TSC frequency 1000155810 Hz cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges cpu0: apic clock running at 199MHz cpu0: mwait min=64, max=64, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: AMD G-T40E Processor, 1000.00 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,C FLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,MWAIT,SSSE3,CX16,POPCNT,NXE,MMXX,FFXSR,PAGE1GB ,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,IBS,SKINIT,IT SC cpu1: 32KB 64b/line 2-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 1 6-way L2 cache cpu1: 8 4MB entries fully associative cpu1: DTLB 40 4KB entries fully associative, 8 4MB entries fully associative cpu1: smt 0, core 1, package 0 ioapic0 at mainbus0: apid 2 pa 0xfec00000, version 21, 24 pins acpiprt0 at acpi0: bus -1 (AGPB) acpiprt1 at acpi0: bus -1 (HDMI) acpiprt2 at acpi0: bus 1 (PBR4) acpiprt3 at acpi0: bus 2 (PBR5) acpiprt4 at acpi0: bus 3 (PBR6) acpiprt5 at acpi0: bus -1 (PBR7) acpiprt6 at acpi0: bus 5 (PE20) acpiprt7 at acpi0: bus -1 (PE21) acpiprt8 at acpi0: bus -1 (PE22) acpiprt9 at acpi0: bus -1 (PE23) acpiprt10 at acpi0: bus 0 (PCI0) acpiprt11 at acpi0: bus 4 (PIBR) acpicpu0 at acpi0: C2(0@100 io@0x841), C1(@1 halt!), PSS acpicpu1 at acpi0: C2(0@100 io@0x841), C1(@1 halt!), PSS acpibtn0 at acpi0: PWRB cpu0: 1000 MHz: speeds: 1000 800 MHz pci0 at mainbus0 bus 0 pchb0 at pci0 dev 0 function 0 "AMD AMD64 14h Host" rev 0x00 ppb0 at pci0 dev 4 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi pci1 at ppb0 bus 1 re0 at pci1 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), m si, address 00:0d:b9:33:8e:a4 rgephy0 at re0 phy 7: RTL8169S/8110S/8211 PHY, rev. 4 ppb1 at pci0 dev 5 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi pci2 at ppb1 bus 2 re1 at pci2 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), m si, address 00:0d:b9:33:8e:a5 rgephy1 at re1 phy 7: RTL8169S/8110S/8211 PHY, rev. 4 ppb2 at pci0 dev 6 function 0 "AMD AMD64 14h PCIE" rev 0x00: msi pci3 at ppb2 bus 3 re2 at pci3 dev 0 function 0 "Realtek 8168" rev 0x06: RTL8168E/8111E (0x2c00), m si, address 00:0d:b9:33:8e:a6 rgephy2 at re2 phy 7: RTL8169S/8110S/8211 PHY, rev. 4 ahci0 at pci0 dev 17 function 0 "ATI SBx00 SATA" rev 0x40: apic 2 int 19, AHCI 1 .2 ahci0: port 0: 6.0Gb/s scsibus1 at ahci0: 32 targets sd0 at scsibus1 targ 0 lun 0: <ATA, SAMSUNG SSD PM83, CXM1> SCSI3 0/direct fixe d naa.5002538043584d30 sd0: 30533MB, 512 bytes/sector, 62533296 sectors, thin ohci0 at pci0 dev 18 function 0 "ATI SB700 USB" rev 0x00: apic 2 int 18, versio n 1.0, legacy support ehci0 at pci0 dev 18 function 2 "ATI SB700 USB2" rev 0x00: apic 2 int 17 usb0 at ehci0: USB revision 2.0 uhub0 at usb0 configuration 1 interface 0 "ATI EHCI root hub" rev 2.00/1.00 add r 1 ohci1 at pci0 dev 19 function 0 "ATI SB700 USB" rev 0x00: apic 2 int 18, versio n 1.0, legacy support ehci1 at pci0 dev 19 function 2 "ATI SB700 USB2" rev 0x00: apic 2 int 17 usb1 at ehci1: USB revision 2.0 uhub1 at usb1 configuration 1 interface 0 "ATI EHCI root hub" rev 2.00/1.00 add r 1 piixpm0 at pci0 dev 20 function 0 "ATI SBx00 SMBus" rev 0x42: polling iic0 at piixpm0 pcib0 at pci0 dev 20 function 3 "ATI SB700 ISA" rev 0x40 ppb3 at pci0 dev 20 function 4 "ATI SB600 PCI" rev 0x40 pci4 at ppb3 bus 4 ohci2 at pci0 dev 20 function 5 "ATI SB700 USB" rev 0x00: apic 2 int 18, versio n 1.0, legacy support ppb4 at pci0 dev 21 function 0 "ATI SB800 PCIE" rev 0x00 pci5 at ppb4 bus 5 ohci3 at pci0 dev 22 function 0 "ATI SB700 USB" rev 0x00: apic 2 int 18, versio n 1.0, legacy support ehci2 at pci0 dev 22 function 2 "ATI SB700 USB2" rev 0x00: apic 2 int 17 usb2 at ehci2: USB revision 2.0 uhub2 at usb2 configuration 1 interface 0 "ATI EHCI root hub" rev 2.00/1.00 add r 1 pchb1 at pci0 dev 24 function 0 "AMD AMD64 14h Link Cfg" rev 0x43 pchb2 at pci0 dev 24 function 1 "AMD AMD64 14h Address Map" rev 0x00 pchb3 at pci0 dev 24 function 2 "AMD AMD64 14h DRAM Cfg" rev 0x00 km0 at pci0 dev 24 function 3 "AMD AMD64 14h Misc Cfg" rev 0x00 pchb4 at pci0 dev 24 function 4 "AMD AMD64 14h CPU Power" rev 0x00 pchb5 at pci0 dev 24 function 5 "AMD AMD64 14h Reserved" rev 0x00 pchb6 at pci0 dev 24 function 6 "AMD AMD64 14h NB Power" rev 0x00 pchb7 at pci0 dev 24 function 7 "AMD AMD64 14h Reserved" rev 0x00 usb3 at ohci0: USB revision 1.0 uhub3 at usb3 configuration 1 interface 0 "ATI OHCI root hub" rev 1.00/1.00 add r 1 usb4 at ohci1: USB revision 1.0 uhub4 at usb4 configuration 1 interface 0 "ATI OHCI root hub" rev 1.00/1.00 add r 1 isa0 at pcib0 isadma0 at isa0 com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo com0: console com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo pcppi0 at isa0 port 0x61 spkr0 at pcppi0 lpt0 at isa0 port 0x378/4 irq 7 wbsio0 at isa0 port 0x2e/2: NCT5104D rev 0x52 usb5 at ohci2: USB revision 1.0 uhub5 at usb5 configuration 1 interface 0 "ATI OHCI root hub" rev 1.00/1.00 add r 1 usb6 at ohci3: USB revision 1.0 uhub6 at usb6 configuration 1 interface 0 "ATI OHCI root hub" rev 1.00/1.00 add r 1 vmm0 at mainbus0: SVM/RVI umass0 at uhub2 port 1 configuration 1 interface 0 "Generic Flash Card Reader/W riter" rev 2.01/1.00 addr 2 umass0: using SCSI over Bulk-Only scsibus2 at umass0: 2 targets, initiator 0 sd1 at scsibus2 targ 1 lun 0: <Multiple, Card Reader, 1.00> SCSI2 0/direct remo vable serial.058f6366058F63666485 sd1: 7600MB, 512 bytes/sector, 15564800 sectors vscsi0 at root scsibus3 at vscsi0: 256 targets softraid0 at root scsibus4 at softraid0: 256 targets root on sd0a (28fcdc10008babff.a) swap on sd0b dump on sd0b re1: watchdog timeout re1: watchdog timeout uvm_fault(0xffffffff819f14a0, 0x400, 0, 1) -> e kernel: page fault trap, code=0 Stopped at re_encap+0x24d: movl 0(%r15),%eax