Lars Eggert wrote:
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; lapic.id = 00000000
> fault virtual address   = 0x34
                            ****
> fault code              = supervisor read, page not present
> instruction pointer     = 0x8:0xc01b28a6

[ ... ]

> kernel: type 12 trap, code=0
> Stopped at      _mtx_lock_flags+0x26:   cmpl    $0xc03884a0,0(%esi)

[ ... ]

> trap_fatal(e91a5780,34,c0372ee0,2e4,c658e780) at trap_fatal+0x250
> trap_pfault(e91a5780,0,34,c03e0758,34) at trap_pfault+0x17a
> trap(c21a0018,10,c0360010,9e,34) at trap+0x3e5
> calltrap() at calltrap+0x5
> --- trap 0xc, eip = 0xc01b28a6, esp = 0xe91a57c0, ebp = 0xe91a57e0 ---
> _mtx_lock_flags(34,0,c035cf5f,9e,c658e780) at _mtx_lock_flags+0x26
                  **

Attempt to dereference the value "0x34" as if it were a pointer.

> namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134

Called from here.

Debug:

        1)      Make sure that the kernel that has the fault was
                created with "config -g", so that there is a debug
                version of it lying around in the build directory.

        2)      Make sure that the kernel you installed is the
                stripped version of the debug kernel (there are two
                kernels created as a result of "config -g"; one is
                "kernel.debug" (the debug version) and the other is
                "kernel" (the stripped version).

        3)      If #1 and #2 are not true, then make them true, and
                repeat the problem.

        4)      Boot a kernel that doesn't crash instead, so that you
                can run the debugger.

        5)      Go to the build directory, and look at the faulting
                code to see where it gets the value "0x34" to pass in
                to the _mtx_lock_flags(); this is the bogus value.  For
                example, if you had a debug kernel for the kernel that
                has the problem, and it was config'ed from i386 GENERIC,
                you would use the following sequence of commands:

                        cd /sys/i386/compile/GENERIC
                        gdb -k kernel.debug
                        list namei+0x134

        6)      Change the code so the bogus value is no longer being
                passed.

        7)      Live happily ever after.


Note that, to me, this looks like a problem with a dereference of a
"current" process which is not really current, as a result of a
wakeup occurring in an interrupt handler for an outstanding request
which was satisfied by the interrupt handler.

Note:   Under no circumstances should a page 0 address be passed
        around to anyone, since page zero is typically unmapped in
        order to trigger NULL pointer dereference faults and/or
        structure member reference faults for structure elements
        (at least in the the initial 4K: range 0x00000000-0x00001000)
        when a structure pointer itself is NULL.

        IMO, the most likely cause is that you have a null structure
        pointer, and the element at offset 0x34 into the structure is
        being referenced out of it, without checking that the pointer
        is not NULL, and the most likely culprit is a proc/kse/thread
        type structure that's not guaranteed to be valid in interrupt
        context.

        Probably, the scheduler is switching directly from interrupt
        of a process context "Q" to a wakeup of the same process "Q",
        without restoring a register value that should normally be
        restored following an interrupt.  I have no idea which of the
        schedulers you are using, so I have no idea if this should be
        an expected omission; my best guess is you are using the new
        one, though, because this is an unlikely problem with the old
        one, if it's really a scheduler wakeup problem.

> namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134
                                   ^
                                   |
> vn_open_cred(e91a5a44,e91a5a0c,0,c2195e80,0) at vn_open_cred+0x53c
                                 ^          ^
                                 |          |
  ...all three of these are also incredibly suspicious, at first sight...


Until you are willing to list out the code where the bogus value is
being passed to the function call, there's no way any of us are
going to be able to correlate your stack traceback to our own source
trees, in order to be able to help you, unless you are running a
tagged veraion (e.g. 5.0-RELEASE) with no modifications.

Just saying "the most recent current" or "I CVS'up'ed on xxx date" is
really useless to us, because CVS mirrors don't contain well known
information relative to a CVS'up date.  In many cases, we will need
you to check out (at least!) a fresh /sys source tree from the CVS
repository, using a date tage, if you are not running a -RELEASE
version.  Yes, this is a long-standing problem with the FreeBSD
project itself.

If you can do this, and repeat the problem, then we can check out with
the same date tag, and determine what the code is supposed to be doing,
and what code you actually have, so we can narrow it down to setup, and
maybe fix it without having to rebuild an entire copy of the Internet,
from your machine's point of view.  8-).

Also, if your kernel configuration is different than the default, you
need to provide *DIFFS* -- DO NOT SEND THE WHOLE CONFIG FILE TO THE
LIST -- OR TO ME -- UNLESS YOU WANT TO BE IGNORED!

For a modified GENERIC config file from a checked out copy of the local
source tree, here is how you perform a context diff:

        cd /sys/i386/conf
        cvs diff -c GENERIC

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Reply via email to