Re: 64bit startup

Sergey Bugaev Tue, 06 Jun 2023 13:23:19 -0700

On Mon, Jun 5, 2023 at 3:03 PM Sergey Bugaev <buga...@gmail.com> wrote:
> That is going to be much easier to debug than debootstrap, thank you!


Unfortunately I'm facing some troubles :|

For one thing you seem to have rebuilt/updated the packages, but not
the rootfs image, so now the debuginfo I download doesn't match the
binaries in the image. Please update the image too!

Also, gdb seems to suddenly have some sort of trouble with
interpreting the debuginfo files, specifically I'm trying ones from
the 'hurd-dbsym' package (to be even more specific:
/usr/lib/debug/.build-id/bf/dd0c0525d0ca383bd842796063345a2dd0c001.debug
from that package, which corresponds to ext2fs.static -- but I've done
a quick check and other files seem to behave the same way too). GDB
loads regular symbols from them, but not the debuginfo, i.e. I can see
what function is at which address, but not map addresses to source
lines or access local variables or use types (tcbhead_t is the one I
currently need most). I don't know enough about GDB and DWARF to
diagnose exactly what's going on; readelf --debug-dump=info seems to
dump the debuginfo just fine.

Please try to reproduce this with your GDB (no Hurd system required),
and if you have changed something recently about how debug files are
generated, maybe that's what has broken it.

So that all being said, here's one crash I am (and have been) seeing a
lot: the crash at any sort of TCB access when fs_base suddenly turns
out to be equal to the address of _kret_popl_ds. This makes no sense
-- surely userspace would never set that, so it must be a gnumach bug.

I've got a little theory of how something like that could happen:

It is my understanding that "the PCB stack" (whatever that is) where
locore.S pushes user's registers and thread->pcb->iss is really the
exact same place, pushing registers onto that stack is exactly writing
to the thread's i386_saved_state structure. The first four members of
struct i386_saved_state are unsigned long fsbase, gsbase, gs, fs --
and being the first members of the struct means they have the lowest
addresses, i.e. are located at the top of the PCB stack.

locore.S actually skips pushing or popping these four members:

#define PUSH_FSGS               \
        subq    $32,%rsp

#define POP_FSGS                \
        addq    $32,%rsp

This is because fs and gs we don't care about, and fsbase/gsbase of a
thread state can only be changed by explicit thread_set_state calls
and not by the thread itself, so, no need to rdmsr and push it, since
the value is already saved in the PCB slot.

However, *something* goes wrong and the fsbase slot gets overwritten
with an unrelated value (_kret_popl_ds). The real %fs_base MSR keeps
the proper value -- until we context-switch away from the thread and
then back to it, at which point the bogus value gets loaded into
%fs_base and then the userland tries to use it and faults.

I don't know nearly enough about x86 interrupts/traps to say, but
could it be that we get another interrupt/trap, while presumably
executing _kret_popl_ds, and that causes the faulting %rip to be
pushed onto the stack, but since we're at the PCB stack at that point
it clobbers the stored fsbase? That doesn't cause issues for all the
other registers because we have already popped their values off and
won't be accessing them anymore; we'll push the new values the next
time the thread enters the kernel -- though I guess it could show up
in thread_get_state if you do that without stopping the thread on an
SMP kernel.

cc'ing Luca -- does what I'm saying make sense? could this happen? can
you reproduce %fs_base getting set to _kret_popl_ds?

Sergey

Re: 64bit startup

Reply via email to