On Aug 16, 2011, at 20:15, Julian Elischer wrote: > from your description it doesn't sound like a vr problem. > I suggest you hook up teh serial console (I am guessing you already have) > and set the config options to allow break-to-debugger or > alt-break-to-debugger on it > when it happens next, drop into the debugger.. > > in fact, drop in, and do a ps to see what processes are runnable, > 'tr [pid] (or thread id)' to get a stack trace of anything that looks > interesting, and then cont and do it again a few times to get a feel > for where the processor is hanging out (a straight 'tr' will give you > the interrupt of the com port which is not intresting..)
Alright, I thought I had locked this down to ZERO_COPY_SOCKETS being enabled in the kernel (as unlikely as it seems that it'd break things without being explicitly used -- I wanted to believe!). A couple days ago I figured out how to consistently make it happen (or not). For about a week we'd been running with no trouble; until I realized that a hostname added to our pf rules made them not load on startup. When running without our pf rules loaded, everything is fine. WIth the pf rules loaded the system will hang within 2-4 hours. Our ruleset was about 240 rules. We cut it down to ~140 rules today to see if that made a difference; it didn't. We also turned off as much IPv6 traffic as we could in case that was what was disturbing things (adding IPv6 traffic is relatively new). Flipping the backup system to be the master for the active IP addresses doesn't (as far as I can tell) make the old master recover. I tried looking for interesting information with the kernel debugger; but honestly nothing springs out as interesting -- though that may be due to my lack of skills more than anything else. I am including a snapshot below. I ran ps and tr on 100024, 100025 (vr0 and vr1 interrupt threads) and 100022 (thread taskq) a few more times but it seemed to me to give basically the same thread trace information. For what it's worth, if I let 'cont' run for long enough for the system to send out carp packets and get it's IPs back, 'ps' seemed more likely to have 100024 or 100025 (vr0/vr1) in 'Run' state. If I interrupted the system again relatively quickly after continuing, it was more likely to have 'thread tasks' be the busy process. Any tips for what I can do to extract something more useful would be greatly appreciated. As a last note -- earlier today I noticed that "Searches" in 'pfctl -si' seemed to go up significantly just before the console locks up (which is 15-30 minutes before the box stops routing packets). My theory was that we were getting some sort of tiny DoS attack (doesn't take much to take out a computer the size of a CD case); but that doesn't really make sense as the system 1) never recovers and 2) the backup is fine as soon as the master has been turned off / changed priorities. - ask db> ps pid ppid pgrp uid state wmesg wchan cmd 3300 2371 2371 0 R sshd 3299 3298 3298 0 S+ sbwait 0xc2ddb58c ssh 3298 3100 3298 0 S+ piperd 0xc2624620 scp 3100 3098 3100 0 Ss+ pause 0xc2e95850 csh 3098 2371 3098 0 Rs sshd 2904 2371 2904 0 Ss select 0xc379a924 sshd 2476 2474 2476 0 Ss+ ttyin 0xc2e44070 csh 2474 2371 2474 0 Rs sshd 2440 1 2440 0 Rs+ getty 2433 1 2433 0 Rs cron 2426 1 2426 62 Rs ftp-proxy 2397 1 2397 0 Ss select 0xc2d71964 inetd 2371 1 2371 0 Rs sshd 2349 1 2349 0 Rs bird 2302 1 2302 0 Rs radvd 2298 2292 2285 65534 S piperd 0xc2624188 multilog 2297 2291 2285 0 R openvpn 2292 2287 2285 0 S select 0xc2d713a4 supervise 2291 2287 2285 0 S select 0xc2d71a64 supervise 2288 1 2285 0 S piperd 0xc2624c40 readproctitle 2287 1 2285 0 R svscan 2244 1 2244 0 Rs ntpd 2129 1 2129 0 Rs syslogd 50 0 0 0 SL mdwait 0xc2717000 [md1] 40 0 0 0 SL mdwait 0xc24ef000 [md0] 22 0 0 0 RL [flowcleaner] 21 0 0 0 SL sdflush 0xc0b24d80 [softdepflush] 20 0 0 0 SL vlruwt 0xc25f52a8 [vnlru] 19 0 0 0 SL syncer 0xc0b14774 [syncer] 18 0 0 0 RL [bufdaemon] 17 0 0 0 RL [pagezero] 16 0 0 0 SL psleep 0xc0b2567c [vmdaemon] 15 0 0 0 SL psleep 0xc0b25644 [pagedaemon] 9 0 0 0 SL ccb_scan 0xc0b006d4 [xpt_thrd] 8 0 0 0 RL [pfpurge] 7 0 0 0 SL waiting_ 0xc0b181d8 [sctp_iterator] 14 0 0 0 SL (threaded) usb 100036 D - 0xc250ed34 [usbus1] 100035 D - 0xc250ed04 [usbus1] 100034 D - 0xc250ecd4 [usbus1] 100033 D - 0xc250eca4 [usbus1] 100032 D - 0xc2504b5c [usbus0] 100031 D - 0xc2504b2c [usbus0] 100030 D - 0xc2504afc [usbus0] 100029 D - 0xc2504acc [usbus0] 13 0 0 0 SL - 0xc0b0df64 [yarrow] 6 0 0 0 SL crypto_r 0xc0b2432c [crypto returns] 5 0 0 0 SL crypto_w 0xc0b24308 [crypto] 4 0 0 0 SL - 0xc0b0bc44 [g_down] 3 0 0 0 SL - 0xc0b0bc40 [g_up] 2 0 0 0 SL - 0xc0b0bc38 [g_event] 12 0 0 0 RL (threaded) intr 100037 I [swi0: uart uart] 100028 I [irq12: ohci0 ehci0] 100027 I [irq14: ata0] 100026 I [irq15: vr2 ata1] 100025 Run CPU 0 [irq11: vr1] 100024 RunQ [irq10: vr0] 100021 I [swi5: +] 100016 I [swi2: cambio] 100015 I [swi6: task queue] 100014 I [swi6: Giant taskq] 100006 I [swi1: netisr 0] 100005 I [swi4: clock] 100004 I [swi3: vm] 11 0 0 0 RL [idle] 1 0 1 0 SLs wait 0xc2433d48 [init] 10 0 0 0 SL audit_wo 0xc0b246a0 [audit] 0 0 0 0 RLs (threaded) kernel 100040 D - 0xc2527880 [dummynet] 100023 D - 0xc24d6580 [glxsb0 taskq] 100022 RunQ [thread taskq] 100020 D - 0xc24afd40 [acpi_task_2] 100019 D - 0xc24afd40 [acpi_task_1] 100018 D - 0xc24afd40 [acpi_task_0] 100017 D - 0xc24afd80 [kqueue taskq] 100010 D - 0xc24512c0 [firmware taskq] 100000 D sched 0xc0b0bd20 [swapper] db> tr 100025 Tracing pid 12 tid 100025 td 0xc24d42d0 kdb_enter(c0a48fbe,c0a5d00a,1,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126 intr_event_handle(c2430980,cc918ac4,cc918aa4,cc918b40,4,...) at 0xc06dff2c = intr_event_handle+0x5c intr_execute_handlers(c0afdc90,cc918ac4,cc918b40,c2ef8a00,cc918b68,...) at 0xc09d4929 = intr_execute_handlers+0x49 atpic_handle_intr(4,cc918ac4) at 0xc09ee047 = atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22 --- interrupt, eip = 0xc09e0f58, esp = 0xcc918b04, ebp = 0xcc918b68 --- bzero(c2ef8a00,0,12,0,0,...) at 0xc09e0f58 = bzero ip_input(c2ef8a00,246,c24a77c0,cc918bd0,c0622c21,...) at 0xc083ff8d = ip_input+0x6cd netisr_dispatch_src(1,0,c2ef8a00,cc918c08,c07bb331,...) at 0xc07c9109 = netisr_dispatch_src+0x89 netisr_dispatch(1,c2ef8a00,c24d8c00,c24d8c00,c2f16008,...) at 0xc07c93a0 = netisr_dispatch+0x20 ether_demux(c24d8c00,c2ef8a00,3,0,3,...) at 0xc07bb331 = ether_demux+0x161 ether_input(c24d8c00,c2ef8a00,c09d6ad0,c0b11780,c24d42d0,...) at 0xc07bb8af = ether_input+0x33f vr_intr(c2499000,c24d42d0,0,109,dc13a3e2,...) at 0xc0690034 = vr_intr+0x4d4 intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) at 0xc06def8b = intr_event_execute_handlers+0x13b ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea = ithread_loop+0x6a fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 = fork_exit+0x90 fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xcc918d60, ebp = 0 --- db> tr 100024 Tracing pid 12 tid 100024 td 0xc24d45a0 sched_switch(c24d45a0,0,109,dc1435c0,660,...) at 0xc07270bf = sched_switch+0x1df mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90 fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 --- db> cont KDB: enter: Line break on console [thread pid 0 tid 100022 ] Stopped at 0xc0733b3a = kdb_enter+0x3a: movl $0,0xc0b124b4 = kdb_why db> ps pid ppid pgrp uid state wmesg wchan cmd 3300 2371 2371 0 R sshd 3299 3298 3298 0 S+ sbwait 0xc2ddb58c ssh 3298 3100 3298 0 S+ piperd 0xc2624620 scp 3100 3098 3100 0 Ss+ pause 0xc2e95850 csh 3098 2371 3098 0 Rs sshd 2904 2371 2904 0 Ss select 0xc379a924 sshd 2476 2474 2476 0 Ss+ ttyin 0xc2e44070 csh 2474 2371 2474 0 Rs sshd 2440 1 2440 0 Rs+ getty 2433 1 2433 0 Rs cron 2426 1 2426 62 Rs ftp-proxy 2397 1 2397 0 Ss select 0xc2d71964 inetd 2371 1 2371 0 Rs sshd 2349 1 2349 0 Rs bird 2302 1 2302 0 Rs radvd 2298 2292 2285 65534 S piperd 0xc2624188 multilog 2297 2291 2285 0 R openvpn 2292 2287 2285 0 S select 0xc2d713a4 supervise 2291 2287 2285 0 S select 0xc2d71a64 supervise 2288 1 2285 0 S piperd 0xc2624c40 readproctitle 2287 1 2285 0 R svscan 2244 1 2244 0 Rs ntpd 2129 1 2129 0 Rs syslogd 50 0 0 0 SL mdwait 0xc2717000 [md1] 40 0 0 0 SL mdwait 0xc24ef000 [md0] 22 0 0 0 RL [flowcleaner] 21 0 0 0 SL sdflush 0xc0b24d80 [softdepflush] 20 0 0 0 SL vlruwt 0xc25f52a8 [vnlru] 19 0 0 0 SL syncer 0xc0b14774 [syncer] 18 0 0 0 RL [bufdaemon] 17 0 0 0 RL [pagezero] 16 0 0 0 SL psleep 0xc0b2567c [vmdaemon] 15 0 0 0 SL psleep 0xc0b25644 [pagedaemon] 9 0 0 0 SL ccb_scan 0xc0b006d4 [xpt_thrd] 8 0 0 0 RL [pfpurge] 7 0 0 0 SL waiting_ 0xc0b181d8 [sctp_iterator] 14 0 0 0 SL (threaded) usb 100036 D - 0xc250ed34 [usbus1] 100035 D - 0xc250ed04 [usbus1] 100034 D - 0xc250ecd4 [usbus1] 100033 D - 0xc250eca4 [usbus1] 100032 D - 0xc2504b5c [usbus0] 100031 D - 0xc2504b2c [usbus0] 100030 D - 0xc2504afc [usbus0] 100029 D - 0xc2504acc [usbus0] 13 0 0 0 SL - 0xc0b0df64 [yarrow] 6 0 0 0 SL crypto_r 0xc0b2432c [crypto returns] 5 0 0 0 SL crypto_w 0xc0b24308 [crypto] 4 0 0 0 SL - 0xc0b0bc44 [g_down] 3 0 0 0 SL - 0xc0b0bc40 [g_up] 2 0 0 0 SL - 0xc0b0bc38 [g_event] 12 0 0 0 WL (threaded) intr 100037 I [swi0: uart uart] 100028 I [irq12: ohci0 ehci0] 100027 I [irq14: ata0] 100026 I [irq15: vr2 ata1] 100025 I [irq11: vr1] 100024 I [irq10: vr0] 100021 I [swi5: +] 100016 I [swi2: cambio] 100015 I [swi6: task queue] 100014 I [swi6: Giant taskq] 100006 I [swi1: netisr 0] 100005 I [swi4: clock] 100004 I [swi3: vm] 11 0 0 0 RL [idle] 1 0 1 0 SLs wait 0xc2433d48 [init] 10 0 0 0 SL audit_wo 0xc0b246a0 [audit] 0 0 0 0 RLs (threaded) kernel 100040 D - 0xc2527880 [dummynet] 100023 D - 0xc24d6580 [glxsb0 taskq] 100022 Run CPU 0 [thread taskq] 100020 D - 0xc24afd40 [acpi_task_2] 100019 D - 0xc24afd40 [acpi_task_1] 100018 D - 0xc24afd40 [acpi_task_0] 100017 D - 0xc24afd80 [kqueue taskq] 100010 D - 0xc24512c0 [firmware taskq] 100000 D sched 0xc0b0bd20 [swapper] db> tr 100024 Tracing pid 12 tid 100024 td 0xc24d45a0 sched_switch(c24d45a0,0,109,b67db121,673,...) at 0xc07270bf = sched_switch+0x1df mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90 fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 --- db> tr 100022 Tracing pid 0 tid 100022 td 0xc24d4b40 kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126 intr_event_handle(c2430980,c23f5c20,0,c25db2bc,4,...) at 0xc06dff2c = intr_event_handle+0x5c intr_execute_handlers(c0afdc90,c23f5c20,c25db2bc,0,c23f5c60,...) at 0xc09d4929 = intr_execute_handlers+0x49 atpic_handle_intr(4,c23f5c20) at 0xc09ee047 = atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22 --- interrupt, eip = 0xc073cf01, esp = 0xc23f5c60, ebp = 0xc23f5c60 --- sleepq_release(c25db2bc,0,0,0,c24afc80,...) at 0xc073cf01 = sleepq_release+0x31 wakeup(c25db2bc,0,0,0,0,...) at 0xc07109f1 = wakeup+0x41 taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e8e2 = taskqueue_run_locked+0x142 taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 = taskqueue_thread_loop+0x87 fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 = fork_exit+0x90 fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xc23f5d60, ebp = 0 --- ... cont dance again ... db> tr 100022 Tracing pid 0 tid 100022 td 0xc24d4b40 kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126 intr_event_handle(c2430980,c23f5c48,c23f5c14,c24afc80,4,...) at 0xc06dff2c = intr_event_handle+0x5c intr_execute_handlers(c0afdc90,c23f5c48,c24afc80,c25db2bc,c23f5cbc,...) at 0xc09d4929 = intr_execute_handlers+0x49 atpic_handle_intr(4,c23f5c48) at 0xc09ee047 = atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22 --- interrupt, eip = 0xc073e7e3, esp = 0xc23f5c88, ebp = 0xc23f5cbc --- taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e7e3 = taskqueue_run_locked+0x43 taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 = taskqueue_thread_loop+0x87 fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 = fork_exit+0x90 fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xc23f5d60, ebp = 0 --- db> tr 100024 Tracing pid 12 tid 100024 td 0xc24d45a0 sched_switch(c24d45a0,0,109,4b19c082,684,...) at 0xc07270bf = sched_switch+0x1df mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90 fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 --- db> tr 100025 Tracing pid 12 tid 100025 td 0xc24d42d0 kdb_enter(c0a48fbe,c0a5d00a,58,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126 intr_event_handle(c2430980,cc918ac8,0,0,4,...) at 0xc06dff2c = intr_event_handle+0x5c intr_execute_handlers(c0afdc90,cc918ac8,0,c2efa300,cc918b68,...) at 0xc09d4929 = intr_execute_handlers+0x49 atpic_handle_intr(4,cc918ac8) at 0xc09ee047 = atpic_handle_intr+0x67 Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22 --- interrupt, eip = 0xc083e822, esp = 0xcc918b08, ebp = 0xcc918b68 --- ip_forward(c2efa300,0,12,0,0,...) at 0xc083e822 = ip_forward+0x3b2 ip_input(c2efa300,246,c24a1580,cc918bd0,c0622c21,...) at 0xc083ff8d = ip_input+0x6cd netisr_dispatch_src(1,0,c2efa300,cc918c08,c07bb331,...) at 0xc07c9109 = netisr_dispatch_src+0x89 netisr_dispatch(1,c2efa300,c24d8c00,c24d8c00,c2f1a808,...) at 0xc07c93a0 = netisr_dispatch+0x20 ether_demux(c24d8c00,c2efa300,3,0,3,...) at 0xc07bb331 = ether_demux+0x161 ether_input(c24d8c00,c2efa300,c09d6ad0,c0b11780,c24d42d0,...) at 0xc07bb8af = ether_input+0x33f vr_intr(c2499000,c24d42d0,0,109,48d02b08,...) at 0xc0690034 = vr_intr+0x4d4 intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) at 0xc06def8b = intr_event_execute_handlers+0x13b ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea = ithread_loop+0x6a fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 = fork_exit+0x90 fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xcc918d60, ebp = 0 --- _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"