On Aug 16, 2011, at 20:15, Julian Elischer wrote:

> from your description it doesn't sound like a vr problem.
> I suggest you hook up teh serial console (I am guessing you already have)
> and set the config options to allow break-to-debugger or 
> alt-break-to-debugger on it
> when it happens next, drop into the debugger..
> 
> in fact, drop in, and do a ps to see what processes are runnable,
> 'tr [pid] (or thread id)' to get a stack trace of anything that looks
> interesting, and then cont and do it again a few times to get a feel
> for where the processor is hanging out (a straight 'tr' will give you
> the interrupt of the com port which is not intresting..)

Alright, I thought I had locked this down to ZERO_COPY_SOCKETS being enabled in 
the kernel (as unlikely as it seems that it'd break things without being 
explicitly used -- I wanted to believe!).

A couple days ago I figured out how to consistently make it happen (or not).  
For about a week we'd been running with no trouble; until I realized that a 
hostname added to our pf rules made them not load on startup.  When running 
without our pf rules loaded, everything is fine.  WIth the pf rules loaded the 
system will hang within 2-4 hours.

Our ruleset was about 240 rules.  We cut it down to ~140 rules today to see if 
that made a difference; it didn't.  We also turned off as much IPv6 traffic as 
we could in case that was what was disturbing things (adding IPv6 traffic is 
relatively new).

Flipping the backup system to be the master for the active IP addresses doesn't 
(as far as I can tell) make the old master recover.

I tried looking for interesting information with the kernel debugger; but 
honestly nothing springs out as interesting -- though that may be due to my 
lack of skills more than anything else.   I am including a snapshot below.  I 
ran ps and tr on 100024, 100025 (vr0 and vr1 interrupt threads) and 100022 
(thread taskq) a few more times but it seemed to me to give basically the same 
thread trace information.

For what it's worth, if I let 'cont' run for long enough for the system to send 
out carp packets and get it's IPs back, 'ps' seemed more likely to have 100024 
or 100025 (vr0/vr1) in 'Run' state.  If I interrupted the system again 
relatively quickly after continuing, it was more likely to have 'thread tasks' 
be the busy process.

Any tips for what I can do to extract something more useful would be greatly 
appreciated.

As a last note -- earlier today I noticed that "Searches" in 'pfctl -si' seemed 
to go up significantly just before the console locks up (which is 15-30 minutes 
before the box stops routing packets).   My theory was that we were getting 
some sort of tiny DoS attack (doesn't take much to take out a computer the size 
of a CD case); but that doesn't really make sense as the system 1) never 
recovers and 2) the backup is fine as soon as the master has been turned off / 
changed priorities.


 - ask


db> ps
  pid  ppid  pgrp   uid   state   wmesg     wchan    cmd
 3300  2371  2371     0  R                           sshd
 3299  3298  3298     0  S+      sbwait   0xc2ddb58c ssh
 3298  3100  3298     0  S+      piperd   0xc2624620 scp
 3100  3098  3100     0  Ss+     pause    0xc2e95850 csh
 3098  2371  3098     0  Rs                          sshd
 2904  2371  2904     0  Ss      select   0xc379a924 sshd
 2476  2474  2476     0  Ss+     ttyin    0xc2e44070 csh
 2474  2371  2474     0  Rs                          sshd
 2440     1  2440     0  Rs+                         getty
 2433     1  2433     0  Rs                          cron
 2426     1  2426    62  Rs                          ftp-proxy
 2397     1  2397     0  Ss      select   0xc2d71964 inetd
 2371     1  2371     0  Rs                          sshd
 2349     1  2349     0  Rs                          bird
 2302     1  2302     0  Rs                          radvd
 2298  2292  2285 65534  S       piperd   0xc2624188 multilog
 2297  2291  2285     0  R                           openvpn
 2292  2287  2285     0  S       select   0xc2d713a4 supervise
 2291  2287  2285     0  S       select   0xc2d71a64 supervise
 2288     1  2285     0  S       piperd   0xc2624c40 readproctitle
 2287     1  2285     0  R                           svscan
 2244     1  2244     0  Rs                          ntpd
 2129     1  2129     0  Rs                          syslogd
   50     0     0     0  SL      mdwait   0xc2717000 [md1]
   40     0     0     0  SL      mdwait   0xc24ef000 [md0]
   22     0     0     0  RL                          [flowcleaner]
   21     0     0     0  SL      sdflush  0xc0b24d80 [softdepflush]
   20     0     0     0  SL      vlruwt   0xc25f52a8 [vnlru]
   19     0     0     0  SL      syncer   0xc0b14774 [syncer]
   18     0     0     0  RL                          [bufdaemon]
   17     0     0     0  RL                          [pagezero]
   16     0     0     0  SL      psleep   0xc0b2567c [vmdaemon]
   15     0     0     0  SL      psleep   0xc0b25644 [pagedaemon]
    9     0     0     0  SL      ccb_scan 0xc0b006d4 [xpt_thrd]
    8     0     0     0  RL                          [pfpurge]
    7     0     0     0  SL      waiting_ 0xc0b181d8 [sctp_iterator]
   14     0     0     0  SL      (threaded)          usb
100036                   D       -        0xc250ed34 [usbus1]
100035                   D       -        0xc250ed04 [usbus1]
100034                   D       -        0xc250ecd4 [usbus1]
100033                   D       -        0xc250eca4 [usbus1]
100032                   D       -        0xc2504b5c [usbus0]
100031                   D       -        0xc2504b2c [usbus0]
100030                   D       -        0xc2504afc [usbus0]
100029                   D       -        0xc2504acc [usbus0]
   13     0     0     0  SL      -        0xc0b0df64 [yarrow]
    6     0     0     0  SL      crypto_r 0xc0b2432c [crypto returns]
    5     0     0     0  SL      crypto_w 0xc0b24308 [crypto]
    4     0     0     0  SL      -        0xc0b0bc44 [g_down]
    3     0     0     0  SL      -        0xc0b0bc40 [g_up]
    2     0     0     0  SL      -        0xc0b0bc38 [g_event]
   12     0     0     0  RL      (threaded)          intr
100037                   I                           [swi0: uart uart]
100028                   I                           [irq12: ohci0 ehci0]
100027                   I                           [irq14: ata0]
100026                   I                           [irq15: vr2 ata1]
100025                   Run     CPU 0               [irq11: vr1]
100024                   RunQ                        [irq10: vr0]
100021                   I                           [swi5: +]
100016                   I                           [swi2: cambio]
100015                   I                           [swi6: task queue]
100014                   I                           [swi6: Giant taskq]
100006                   I                           [swi1: netisr 0]
100005                   I                           [swi4: clock]
100004                   I                           [swi3: vm]
   11     0     0     0  RL                          [idle]
    1     0     1     0  SLs     wait     0xc2433d48 [init]
   10     0     0     0  SL      audit_wo 0xc0b246a0 [audit]
    0     0     0     0  RLs     (threaded)          kernel
100040                   D       -        0xc2527880 [dummynet]
100023                   D       -        0xc24d6580 [glxsb0 taskq]
100022                   RunQ                        [thread taskq]
100020                   D       -        0xc24afd40 [acpi_task_2]
100019                   D       -        0xc24afd40 [acpi_task_1]
100018                   D       -        0xc24afd40 [acpi_task_0]
100017                   D       -        0xc24afd80 [kqueue taskq]
100010                   D       -        0xc24512c0 [firmware taskq]
100000                   D       sched    0xc0b0bd20 [swapper]


db>  tr 100025
Tracing pid 12 tid 100025 td 0xc24d42d0
kdb_enter(c0a48fbe,c0a5d00a,1,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 = 
uart_intr+0x126
intr_event_handle(c2430980,cc918ac4,cc918aa4,cc918b40,4,...) at 0xc06dff2c = 
intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,cc918ac4,cc918b40,c2ef8a00,cc918b68,...) at 
0xc09d4929 = intr_execute_handlers+0x49
atpic_handle_intr(4,cc918ac4) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc09e0f58, esp = 0xcc918b04, ebp = 0xcc918b68 ---
bzero(c2ef8a00,0,12,0,0,...) at 0xc09e0f58 = bzero
ip_input(c2ef8a00,246,c24a77c0,cc918bd0,c0622c21,...) at 0xc083ff8d = 
ip_input+0x6cd
netisr_dispatch_src(1,0,c2ef8a00,cc918c08,c07bb331,...) at 0xc07c9109 = 
netisr_dispatch_src+0x89
netisr_dispatch(1,c2ef8a00,c24d8c00,c24d8c00,c2f16008,...) at 0xc07c93a0 = 
netisr_dispatch+0x20
ether_demux(c24d8c00,c2ef8a00,3,0,3,...) at 0xc07bb331 = ether_demux+0x161
ether_input(c24d8c00,c2ef8a00,c09d6ad0,c0b11780,c24d42d0,...) at 0xc07bb8af = 
ether_input+0x33f
vr_intr(c2499000,c24d42d0,0,109,dc13a3e2,...) at 0xc0690034 = vr_intr+0x4d4
intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) at 
0xc06def8b = intr_event_execute_handlers+0x13b
ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea = ithread_loop+0x6a
fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc918d60, ebp = 0 ---

db> tr 100024
Tracing pid 12 tid 100024 td 0xc24d45a0
sched_switch(c24d45a0,0,109,dc1435c0,660,...) at 0xc07270bf = sched_switch+0x1df
mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a
ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f
fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 ---


db> cont
KDB: enter: Line break on console
[thread pid 0 tid 100022 ]
Stopped at      0xc0733b3a = kdb_enter+0x3a:    movl    $0,0xc0b124b4 = kdb_why
db> ps 
  pid  ppid  pgrp   uid   state   wmesg     wchan    cmd
 3300  2371  2371     0  R                           sshd
 3299  3298  3298     0  S+      sbwait   0xc2ddb58c ssh
 3298  3100  3298     0  S+      piperd   0xc2624620 scp
 3100  3098  3100     0  Ss+     pause    0xc2e95850 csh
 3098  2371  3098     0  Rs                          sshd
 2904  2371  2904     0  Ss      select   0xc379a924 sshd
 2476  2474  2476     0  Ss+     ttyin    0xc2e44070 csh
 2474  2371  2474     0  Rs                          sshd
 2440     1  2440     0  Rs+                         getty
 2433     1  2433     0  Rs                          cron
 2426     1  2426    62  Rs                          ftp-proxy
 2397     1  2397     0  Ss      select   0xc2d71964 inetd
 2371     1  2371     0  Rs                          sshd
 2349     1  2349     0  Rs                          bird
 2302     1  2302     0  Rs                          radvd
 2298  2292  2285 65534  S       piperd   0xc2624188 multilog
 2297  2291  2285     0  R                           openvpn
 2292  2287  2285     0  S       select   0xc2d713a4 supervise
 2291  2287  2285     0  S       select   0xc2d71a64 supervise
 2288     1  2285     0  S       piperd   0xc2624c40 readproctitle
 2287     1  2285     0  R                           svscan
 2244     1  2244     0  Rs                          ntpd
 2129     1  2129     0  Rs                          syslogd
   50     0     0     0  SL      mdwait   0xc2717000 [md1]
   40     0     0     0  SL      mdwait   0xc24ef000 [md0]
   22     0     0     0  RL                          [flowcleaner]
   21     0     0     0  SL      sdflush  0xc0b24d80 [softdepflush]
   20     0     0     0  SL      vlruwt   0xc25f52a8 [vnlru]
   19     0     0     0  SL      syncer   0xc0b14774 [syncer]
   18     0     0     0  RL                          [bufdaemon]
   17     0     0     0  RL                          [pagezero]
   16     0     0     0  SL      psleep   0xc0b2567c [vmdaemon]
   15     0     0     0  SL      psleep   0xc0b25644 [pagedaemon]
    9     0     0     0  SL      ccb_scan 0xc0b006d4 [xpt_thrd]
    8     0     0     0  RL                          [pfpurge]
    7     0     0     0  SL      waiting_ 0xc0b181d8 [sctp_iterator]
   14     0     0     0  SL      (threaded)          usb
100036                   D       -        0xc250ed34 [usbus1]
100035                   D       -        0xc250ed04 [usbus1]
100034                   D       -        0xc250ecd4 [usbus1]
100033                   D       -        0xc250eca4 [usbus1]
100032                   D       -        0xc2504b5c [usbus0]
100031                   D       -        0xc2504b2c [usbus0]
100030                   D       -        0xc2504afc [usbus0]
100029                   D       -        0xc2504acc [usbus0]
   13     0     0     0  SL      -        0xc0b0df64 [yarrow]
    6     0     0     0  SL      crypto_r 0xc0b2432c [crypto returns]
    5     0     0     0  SL      crypto_w 0xc0b24308 [crypto]
    4     0     0     0  SL      -        0xc0b0bc44 [g_down]
    3     0     0     0  SL      -        0xc0b0bc40 [g_up]
    2     0     0     0  SL      -        0xc0b0bc38 [g_event]
   12     0     0     0  WL      (threaded)          intr
100037                   I                           [swi0: uart uart]
100028                   I                           [irq12: ohci0 ehci0]
100027                   I                           [irq14: ata0]
100026                   I                           [irq15: vr2 ata1]
100025                   I                           [irq11: vr1]
100024                   I                           [irq10: vr0]
100021                   I                           [swi5: +]
100016                   I                           [swi2: cambio]
100015                   I                           [swi6: task queue]
100014                   I                           [swi6: Giant taskq]
100006                   I                           [swi1: netisr 0]
100005                   I                           [swi4: clock]
100004                   I                           [swi3: vm]
   11     0     0     0  RL                          [idle]
    1     0     1     0  SLs     wait     0xc2433d48 [init]
   10     0     0     0  SL      audit_wo 0xc0b246a0 [audit]
    0     0     0     0  RLs     (threaded)          kernel
100040                   D       -        0xc2527880 [dummynet]
100023                   D       -        0xc24d6580 [glxsb0 taskq]
100022                   Run     CPU 0               [thread taskq]
100020                   D       -        0xc24afd40 [acpi_task_2]
100019                   D       -        0xc24afd40 [acpi_task_1]
100018                   D       -        0xc24afd40 [acpi_task_0]
100017                   D       -        0xc24afd80 [kqueue taskq]
100010                   D       -        0xc24512c0 [firmware taskq]
100000                   D       sched    0xc0b0bd20 [swapper]
db>  tr 100024
Tracing pid 12 tid 100024 td 0xc24d45a0
sched_switch(c24d45a0,0,109,b67db121,673,...) at 0xc07270bf = sched_switch+0x1df
mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a
ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f
fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 ---
db> tr 100022
Tracing pid 0 tid 100022 td 0xc24d4b40
kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 = 
uart_intr+0x126
intr_event_handle(c2430980,c23f5c20,0,c25db2bc,4,...) at 0xc06dff2c = 
intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,c23f5c20,c25db2bc,0,c23f5c60,...) at 0xc09d4929 
= intr_execute_handlers+0x49
atpic_handle_intr(4,c23f5c20) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc073cf01, esp = 0xc23f5c60, ebp = 0xc23f5c60 ---
sleepq_release(c25db2bc,0,0,0,c24afc80,...) at 0xc073cf01 = sleepq_release+0x31
wakeup(c25db2bc,0,0,0,0,...) at 0xc07109f1 = wakeup+0x41
taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e8e2 = 
taskqueue_run_locked+0x142
taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 = 
taskqueue_thread_loop+0x87
fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc23f5d60, ebp = 0 ---

... cont dance again ...

db> tr 100022
Tracing pid 0 tid 100022 td 0xc24d4b40
kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 = 
uart_intr+0x126
intr_event_handle(c2430980,c23f5c48,c23f5c14,c24afc80,4,...) at 0xc06dff2c = 
intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,c23f5c48,c24afc80,c25db2bc,c23f5cbc,...) at 
0xc09d4929 = intr_execute_handlers+0x49
atpic_handle_intr(4,c23f5c48) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc073e7e3, esp = 0xc23f5c88, ebp = 0xc23f5cbc ---
taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e7e3 = 
taskqueue_run_locked+0x43
taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 = 
taskqueue_thread_loop+0x87
fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc23f5d60, ebp = 0 ---

db> tr 100024
Tracing pid 12 tid 100024 td 0xc24d45a0
sched_switch(c24d45a0,0,109,4b19c082,684,...) at 0xc07270bf = sched_switch+0x1df
mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a
ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f
fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 ---

db> tr 100025
Tracing pid 12 tid 100025 td 0xc24d42d0
kdb_enter(c0a48fbe,c0a5d00a,58,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 = 
uart_intr+0x126
intr_event_handle(c2430980,cc918ac8,0,0,4,...) at 0xc06dff2c = 
intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,cc918ac8,0,c2efa300,cc918b68,...) at 0xc09d4929 
= intr_execute_handlers+0x49
atpic_handle_intr(4,cc918ac8) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc083e822, esp = 0xcc918b08, ebp = 0xcc918b68 ---
ip_forward(c2efa300,0,12,0,0,...) at 0xc083e822 = ip_forward+0x3b2
ip_input(c2efa300,246,c24a1580,cc918bd0,c0622c21,...) at 0xc083ff8d = 
ip_input+0x6cd
netisr_dispatch_src(1,0,c2efa300,cc918c08,c07bb331,...) at 0xc07c9109 = 
netisr_dispatch_src+0x89
netisr_dispatch(1,c2efa300,c24d8c00,c24d8c00,c2f1a808,...) at 0xc07c93a0 = 
netisr_dispatch+0x20
ether_demux(c24d8c00,c2efa300,3,0,3,...) at 0xc07bb331 = ether_demux+0x161
ether_input(c24d8c00,c2efa300,c09d6ad0,c0b11780,c24d42d0,...) at 0xc07bb8af = 
ether_input+0x33f
vr_intr(c2499000,c24d42d0,0,109,48d02b08,...) at 0xc0690034 = vr_intr+0x4d4
intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) at 
0xc06def8b = intr_event_execute_handlers+0x13b
ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea = ithread_loop+0x6a
fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc918d60, ebp = 0 ---


_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to