This look familiar to anyone? (bug in 4.11 maybe)

Julian Elischer Tue, 24 Jul 2001 15:34:43 -0700


I know this is not a -current problem, but if it was fixed by someone they
are likely to be reading here, and not in -stable..


We have a hybrid (4.11+patches) kernel that sometimes crashes.
The crash always has teh same symptoms and I'm hoping that 
they look familiar to someone...

The message is below, followed by analysis.

Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xe6b95cc8
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0xc01846d9
stack pointer           = 0x10:0xc954de64
frame pointer           = 0x10:0xc954de84
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 10326 (qftListener)
interrupt mask          = none
trap number             = 12


In a VFS operation, %ecx get's corrupted (maybe from an interrupt?)
betweeen the instruction where it's loaded with a constant,
and the instruction where it's used...  It'always the same instruction,
though often in DIFFERENT VFS instructions (fsync, bwrite so far)

the trap frame  usually looks like:

#4  0xc0251813 in trap (frame={tf_fs = 0x10, tf_es = 0x10, tf_ds = 0x10,
tf_edi = 0x0, tf_esi = 0x1, tf_ebp = 0xc954de84, 
      tf_isp = 0xc954de50, tf_ebx = 0xc27d6d80, tf_edx = 0xc1344600,
tf_ecx = 0xc96145b2, tf_eax = 0xc954de78, tf_trapno = 0xc, 
      tf_err = 0x0, tf_eip = 0xc01846d9, tf_cs = 0x8, tf_eflags = 0x10286,
tf_esp = 0xc954de78, tf_ss = 0xc27d6d80})
    at /usr/src/sys/i386/i386/trap.c:443
#5  0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
#6  0xc0189be2 in vop_stdbwrite (ap=0xc954deb4) at
/usr/src/sys/kern/vfs_default.c:319


the code there looks like:

(kgdb) up 5
#5  0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
923             rc = VCALL(vp, VOFFSET(vop_strategy), &a);
(kgdb) list
918             struct vop_strategy_args a;
919             int rc;
920             a.a_desc = VDESC(vop_strategy);
921             a.a_vp = vp;
922             a.a_bp = bp;
923             rc = VCALL(vp, VOFFSET(vop_strategy), &a); <-------here
924             return (rc);
925     }
926     struct vop_print_args {
927             struct vnodeop_desc *a_desc;

In Assembler:

0xc01846cc <bwrite+460>:        mov    0xc029dcc0,%ecx
0xc01846d2 <bwrite+466>:        mov    0x18(%eax),%edx
0xc01846d5 <bwrite+469>:        lea    0xfffffff4(%ebp),%eax
0xc01846d8 <bwrite+472>:        push   %eax
0xc01846d9 <bwrite+473>:        mov    (%edx,%ecx,4),%eax <<<<< **POW**
0xc01846dc <bwrite+476>:        call   *%eax
0xc01846de <bwrite+478>:        add    $0x4,%esp
0xc01846e1 <bwrite+481>:        mov    0xfffffff0(%ebp),%eax

looking at the regs,
dx = 0xc1344600,
cx = 0xc96145b2,
and 
C1344600+(4*C96145B2) = 3E6B95CC8
the lower 32 bits of which is the same as the fault address

but in the  code above we see that %cx was just loaded from 
location 0xc029dcc0 which contains:
(kgdb) x/x 0xc029dcc0     
0xc029dcc0 <vop_strategy_desc>: 0x12

0x12 is the correct offset for a strategy call.

so cx got corrupted between the instruction at 0xc01846cc
and that at 0xc01846d9.

Note that the contents of cx (0xc96145b2) is an address
somewhat higher than the kernel stack at the time in question.
a dump of ram in that area shows:
(kgdb) x/64xw 0xc96145a0
0xc96145a0:     0xc954e900      0xc9709c00      0x00000000      0xc96145a8
0xc96145b0:    [0xc9580660]     0xc95c7370      0xc04d7504      0xc04d47d4
0xc96145c0:     0x0000aa26      0x00000020      0x00000000      0x00000000
0xc96145d0:     0xfc812c38      0x00000002      0x00040010      0x00000020
0xc96145e0:     0x00000000      0x00000000      0x00000000      0x00000000
0xc96145f0:     0x00000000      0xc9636a40      0x0001fc93      0x00000000
0xc9614600:     0xc02ed7c0      0xc95b4120      0x00000000      0xc9614608
0xc9614610:     0x00000000      0xc9555548      0x00000000      0xc9614618
0xc9614620:     0x00003f5b      0x00000003      0x00000000      0x00000000
0xc9614630:     0xfe37c115      0x21880000      0x0000000e      0x00000000
0xc9614640:     0x00000000      0x00000000      0x00000000      0x00000000
0xc9614650:     0x00000000      0x00000000      0x00000000      0x00000000
0xc9614660:     0xc9722ae0      0xc961c600      0x00000000      0xc9614668
0xc9614670:     0xc9690660      0xc97091f0      0x00000000      0xc9614678
0xc9614680:     0x0000cabf      0x00000012      0x00000000      0x00000000
0xc9614690:     0xfc8189f2      0x00000002      0x0000001d      0x00000000

This is obviously  SOMETHING, but what? And why does %cx point HALF WAY
THROUGH an obvious 32 bit pointer?

Thoughts of hardware problems do come to mind... but..

My present line of attack is to change the page-fault handler
to leave a 500 byte window untouched on the stack (except for the 
frame) so that I can try see if an interrupt occured
recently, and if so, what it was....



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message
This look familiar to anyone? (bug in 4.11 maybe)

Reply via email to