On Sat, Jun 08, 2024 at 10:10:58PM -0400, Mouse wrote: > First thing I'd look at is the userland instruction(s) around the crash > point, maybe look at instructions starting at 0xbb610480 or something > and then disassemble forwards looking for 0xbb610579. In particular, > I'd be interested in whether it's a store instruction that failed or > whether this happened during a syscall trap.
0xbb610570 <__gettimeofday50>: mov $0x1a2,%eax 0xbb610575 <__gettimeofday50+5>: int $0x80 0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10> => 0xbb610579 <__gettimeofday50+9>: ret > Are all the failures in __gettimeofday50? All in trap-to-the-kernel > calls? I have seen many crashes on system call returns. Another one on __gettimeofday50: 0xbb610570 <__gettimeofday50>: mov $0x1a2,%eax 0xbb610575 <__gettimeofday50+5>: int $0x80 0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10> 0xbb610579 <__gettimeofday50+9>: ret => 0xbb61057a <__gettimeofday50+10>: push %ebx Another one: 0xbb610570 <__gettimeofday50>: mov $0x1a2,%eax 0xbb610575 <__gettimeofday50+5>: int $0x80 => 0xbb610577 <__gettimeofday50+7>: jb 0xbb61057a <__gettimeofday50+10> 0xbb610579 <__gettimeofday50+9>: ret At once I thought about a stack problem, but I think the last one proves this is not the case. This one involves no memory access. > You say "multiple machines"; are those multiple domUs on a single dom0, > or are they spread across multiple underlying hardware machines? It happens on multiple hardware machines and starts on upgrading the domU. I even tested moving a domU from one machine to another one and the bug folllowed. Other netbsd-9 domU on the same dom0 have no problem, or at least it is rare enough that I did not notice for years. > If the latter, how similar are those underlying machines? Same model: vcpu3: Intel(R) Xeon(R) CPU E3-1220 v6 @ 3.00GHz, id 0x906e9 -- Emmanuel Dreyfus m...@netbsd.org