I know this is not a -current problem, but if it was fixed by someone they
are likely to be reading here, and not in -stable..
We have a hybrid (4.11+patches) kernel that sometimes crashes.
The crash always has teh same symptoms and I'm hoping that
they look familiar to someone...
The message is below, followed by analysis.
Fatal trap 12: page fault while in kernel mode
fault virtual address = 0xe6b95cc8
fault code = supervisor read, page not present
instruction pointer = 0x8:0xc01846d9
stack pointer = 0x10:0xc954de64
frame pointer = 0x10:0xc954de84
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 10326 (qftListener)
interrupt mask = none
trap number = 12
In a VFS operation, %ecx get's corrupted (maybe from an interrupt?)
betweeen the instruction where it's loaded with a constant,
and the instruction where it's used... It'always the same instruction,
though often in DIFFERENT VFS instructions (fsync, bwrite so far)
the trap frame usually looks like:
#4 0xc0251813 in trap (frame={tf_fs = 0x10, tf_es = 0x10, tf_ds = 0x10,
tf_edi = 0x0, tf_esi = 0x1, tf_ebp = 0xc954de84,
tf_isp = 0xc954de50, tf_ebx = 0xc27d6d80, tf_edx = 0xc1344600,
tf_ecx = 0xc96145b2, tf_eax = 0xc954de78, tf_trapno = 0xc,
tf_err = 0x0, tf_eip = 0xc01846d9, tf_cs = 0x8, tf_eflags = 0x10286,
tf_esp = 0xc954de78, tf_ss = 0xc27d6d80})
at /usr/src/sys/i386/i386/trap.c:443
#5 0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
#6 0xc0189be2 in vop_stdbwrite (ap=0xc954deb4) at
/usr/src/sys/kern/vfs_default.c:319
the code there looks like:
(kgdb) up 5
#5 0xc01846d9 in bwrite (bp=0xc27d6d80) at vnode_if.h:923
923 rc = VCALL(vp, VOFFSET(vop_strategy), &a);
(kgdb) list
918 struct vop_strategy_args a;
919 int rc;
920 a.a_desc = VDESC(vop_strategy);
921 a.a_vp = vp;
922 a.a_bp = bp;
923 rc = VCALL(vp, VOFFSET(vop_strategy), &a); <-------here
924 return (rc);
925 }
926 struct vop_print_args {
927 struct vnodeop_desc *a_desc;
In Assembler:
0xc01846cc <bwrite+460>: mov 0xc029dcc0,%ecx
0xc01846d2 <bwrite+466>: mov 0x18(%eax),%edx
0xc01846d5 <bwrite+469>: lea 0xfffffff4(%ebp),%eax
0xc01846d8 <bwrite+472>: push %eax
0xc01846d9 <bwrite+473>: mov (%edx,%ecx,4),%eax <<<<< **POW**
0xc01846dc <bwrite+476>: call *%eax
0xc01846de <bwrite+478>: add $0x4,%esp
0xc01846e1 <bwrite+481>: mov 0xfffffff0(%ebp),%eax
looking at the regs,
dx = 0xc1344600,
cx = 0xc96145b2,
and
C1344600+(4*C96145B2) = 3E6B95CC8
the lower 32 bits of which is the same as the fault address
but in the code above we see that %cx was just loaded from
location 0xc029dcc0 which contains:
(kgdb) x/x 0xc029dcc0
0xc029dcc0 <vop_strategy_desc>: 0x12
0x12 is the correct offset for a strategy call.
so cx got corrupted between the instruction at 0xc01846cc
and that at 0xc01846d9.
Note that the contents of cx (0xc96145b2) is an address
somewhat higher than the kernel stack at the time in question.
a dump of ram in that area shows:
(kgdb) x/64xw 0xc96145a0
0xc96145a0: 0xc954e900 0xc9709c00 0x00000000 0xc96145a8
0xc96145b0: [0xc9580660] 0xc95c7370 0xc04d7504 0xc04d47d4
0xc96145c0: 0x0000aa26 0x00000020 0x00000000 0x00000000
0xc96145d0: 0xfc812c38 0x00000002 0x00040010 0x00000020
0xc96145e0: 0x00000000 0x00000000 0x00000000 0x00000000
0xc96145f0: 0x00000000 0xc9636a40 0x0001fc93 0x00000000
0xc9614600: 0xc02ed7c0 0xc95b4120 0x00000000 0xc9614608
0xc9614610: 0x00000000 0xc9555548 0x00000000 0xc9614618
0xc9614620: 0x00003f5b 0x00000003 0x00000000 0x00000000
0xc9614630: 0xfe37c115 0x21880000 0x0000000e 0x00000000
0xc9614640: 0x00000000 0x00000000 0x00000000 0x00000000
0xc9614650: 0x00000000 0x00000000 0x00000000 0x00000000
0xc9614660: 0xc9722ae0 0xc961c600 0x00000000 0xc9614668
0xc9614670: 0xc9690660 0xc97091f0 0x00000000 0xc9614678
0xc9614680: 0x0000cabf 0x00000012 0x00000000 0x00000000
0xc9614690: 0xfc8189f2 0x00000002 0x0000001d 0x00000000
This is obviously SOMETHING, but what? And why does %cx point HALF WAY
THROUGH an obvious 32 bit pointer?
Thoughts of hardware problems do come to mind... but..
My present line of attack is to change the page-fault handler
to leave a 500 byte window untouched on the stack (except for the
frame) so that I can try see if an interrupt occured
recently, and if so, what it was....
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message