On Mon, Aug 8, 2016 at 9:32 PM, Justus Winter <jus...@gnupg.org> wrote:
> Hello, > > "Brent W. Baccala" <cos...@freesoft.org> writes: > > > I don't have to swapoff to have "symptoms". The kernel debugger normally > > shows symbolic names, i.e: > > > > Stopped at machine_idle+0xe: leave > > machine_idle(0,81a2c630,3806f64,0,9b448b38)+0xe > > idle_thread_continue(9fcbdde0,81028b50,9c0c7fe4,0,9c3d5548)+0x2a > > > > Once I've got enough swap in use, though, it stops doing this. Now I > see: > > > > Stopped at 0x810000be: leave > > 0x810000be(0,0,9fcc5990,0,9fb90b30) > > 0x810293fa(9fcbdde0,81028b50,99526fe4,0,9c3d5548) > > Uh :( that is not good. That sounds like a swap-related corruption in > the kernel. > > > When I see a kernel page fault, it's always in strcmp() > > strcmp is used in the elf symbol lookup code, so that might explain the > fault. > > GDB on the kernel shows a seemingly corrupted ELF symbol table when elf_db_search_symbol() is called. Here's what the symbol table looks like when the system boots: (gdb) print self->start $3 = (Elf32_Sym *) 0x804fb5ec (gdb) print self->start[0] $4 = {st_name = 0, st_value = 0, st_size = 0, st_info = 0 '\000', st_other = 0 '\000', st_shndx = 0} (gdb) print self->start[1] $5 = {st_name = 0, st_value = 2164260864, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 1} (gdb) print self->start[2] $6 = {st_name = 0, st_value = 2165125376, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 2} (gdb) print self->start[3] $7 = {st_name = 0, st_value = 2165262992, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 3} (gdb) print self->start[4] $8 = {st_name = 0, st_value = 2165395456, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 4} (gdb) print self->start[5] $9 = {st_name = 0, st_value = 2165452800, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 5} (gdb) print self->start[6] $10 = {st_name = 0, st_value = 0, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 6} After I run a certain compile (just make, g++, ld), here's what it looks like: (gdb) print self->start $15 = (Elf32_Sym *) 0x804fb5ec (gdb) print self->start[0] $16 = {st_name = 22, st_value = 0, st_size = 0, st_info = 13 '\r', st_other = 26 '\032', st_shndx = 0} (gdb) print self->start[1] $17 = {st_name = 0, st_value = 562210328, st_size = 562101944, st_info = 0 '\000', st_other = 0 '\000', st_shndx = 0} (gdb) print self->start[2] $18 = {st_name = 0, st_value = 0, st_size = 0, st_info = 0 '\000', st_other = 0 '\000', st_shndx = 0} (gdb) print self->start[3] $19 = {st_name = 0, st_value = 0, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 0} (gdb) print self->start[4] $20 = {st_name = 23, st_value = 0, st_size = 0, st_info = 13 '\r', st_other = 26 '\032', st_shndx = 0} (gdb) print self->start[5] $22 = {st_name = 0, st_value = 562210352, st_size = 562210400, st_info = 0 '\000', st_other = 0 '\000', st_shndx = 0} (gdb) print self->start[6] $23 = {st_name = 0, st_value = 0, st_size = 0, st_info = 0 '\000', st_other = 0 '\000', st_shndx = 0} (gdb) print self->start[7] $24 = {st_name = 0, st_value = 0, st_size = 0, st_info = 3 '\003', st_other = 0 '\000', st_shndx = 0} Both GDB traces are with the kernel halted near the beginning of elf_db_search_symbol(), called from the kernel debugger: (gdb) where #0 elf_db_search_symbol (stab=0x81127b00 <db_symtabs>, off=2164261054, strategy=2, diffp=0x81124ea0 <int_stack+3744>) at ../ddb/db_elf.c:159 #1 0x810132e7 in db_search_in_task_symbol (val=2164261054, strategy=2, offp=0x81124f10 <int_stack+3856>, task=0x0) at ../ddb/db_sym.c:354 #2 0x8101342a in db_search_task_symbol (val=2164261054, strategy=2, offp=0x81124f10 <int_stack+3856>, task=0x0) at ../ddb/db_sym.c:315 #3 0x810135dd in db_task_printsym (off=2164261054, strategy=2, task=0x0) at ../ddb/db_sym.c:458 #4 0x8100f377 in db_print_loc_and_inst (loc=2164261054, task=0x0) at ../ddb/db_examine.c:328 #5 0x8104fe9d in db_task_trap (type=-1, code=0, user_space=0) at ../ddb/db_trap.c:92 #6 0x81045d61 in kdb_kentry (int_regs=0x81124fe8 <int_stack+4072>) at ../i386/i386/db_interface.c:392 #7 0x810082ac in kdb_from_iret () at ../i386/i386/locore.S:864 #8 0x942dff6c in ?? () #9 0x81146610 in default_pset () #10 0x00000000 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) Any chance the symbol table could have been swapped out? Any idea how to debug it? > I'm just learning Hurd. Any ideas? > > Keep at it, the Hurd is an interesting system to learn from. But you > might want to start with a simpler problem. > > I wouldn't mind a simpler problem, but I want to get my system cleanly booting and shutting down! I hate this kind of "recursion", but hopefully the result will be a better system. agape brent