On Mon, Jun 5, 2023 at 3:03 PM Sergey Bugaev <buga...@gmail.com> wrote: > That is going to be much easier to debug than debootstrap, thank you!
Unfortunately I'm facing some troubles :| For one thing you seem to have rebuilt/updated the packages, but not the rootfs image, so now the debuginfo I download doesn't match the binaries in the image. Please update the image too! Also, gdb seems to suddenly have some sort of trouble with interpreting the debuginfo files, specifically I'm trying ones from the 'hurd-dbsym' package (to be even more specific: /usr/lib/debug/.build-id/bf/dd0c0525d0ca383bd842796063345a2dd0c001.debug from that package, which corresponds to ext2fs.static -- but I've done a quick check and other files seem to behave the same way too). GDB loads regular symbols from them, but not the debuginfo, i.e. I can see what function is at which address, but not map addresses to source lines or access local variables or use types (tcbhead_t is the one I currently need most). I don't know enough about GDB and DWARF to diagnose exactly what's going on; readelf --debug-dump=info seems to dump the debuginfo just fine. Please try to reproduce this with your GDB (no Hurd system required), and if you have changed something recently about how debug files are generated, maybe that's what has broken it. So that all being said, here's one crash I am (and have been) seeing a lot: the crash at any sort of TCB access when fs_base suddenly turns out to be equal to the address of _kret_popl_ds. This makes no sense -- surely userspace would never set that, so it must be a gnumach bug. I've got a little theory of how something like that could happen: It is my understanding that "the PCB stack" (whatever that is) where locore.S pushes user's registers and thread->pcb->iss is really the exact same place, pushing registers onto that stack is exactly writing to the thread's i386_saved_state structure. The first four members of struct i386_saved_state are unsigned long fsbase, gsbase, gs, fs -- and being the first members of the struct means they have the lowest addresses, i.e. are located at the top of the PCB stack. locore.S actually skips pushing or popping these four members: #define PUSH_FSGS \ subq $32,%rsp #define POP_FSGS \ addq $32,%rsp This is because fs and gs we don't care about, and fsbase/gsbase of a thread state can only be changed by explicit thread_set_state calls and not by the thread itself, so, no need to rdmsr and push it, since the value is already saved in the PCB slot. However, *something* goes wrong and the fsbase slot gets overwritten with an unrelated value (_kret_popl_ds). The real %fs_base MSR keeps the proper value -- until we context-switch away from the thread and then back to it, at which point the bogus value gets loaded into %fs_base and then the userland tries to use it and faults. I don't know nearly enough about x86 interrupts/traps to say, but could it be that we get another interrupt/trap, while presumably executing _kret_popl_ds, and that causes the faulting %rip to be pushed onto the stack, but since we're at the PCB stack at that point it clobbers the stored fsbase? That doesn't cause issues for all the other registers because we have already popped their values off and won't be accessing them anymore; we'll push the new values the next time the thread enters the kernel -- though I guess it could show up in thread_get_state if you do that without stopping the thread on an SMP kernel. cc'ing Luca -- does what I'm saying make sense? could this happen? can you reproduce %fs_base getting set to _kret_popl_ds? Sergey