On Aug 16, 2023, at 4:14 PM, Kurt Miller <k...@intricatesoftware.com> wrote: > >> On Aug 14, 2023, at 5:42 PM, Theo Buehler <t...@theobuehler.org> wrote: >> >> On Mon, Aug 14, 2023 at 08:47:22PM +0000, Miod Vallat wrote: >>> For what it's worth, I couldn't get your test to fail on a dual-cpu >>> sun4u. Either it's a sun4v-specific issue or it needs many more cpus to >>> trigger. >> >> I can reproduce the segfault, but seemingly not the killed process on >> 16-cpu LDOM ona T4-2: >> >> cpu0 at mainbus0: SPARC-T4 (rev 0.0) @ 2847.862 MHz >> >> Segmentation fault (core dumped) >> 93 >> Segmentation fault (core dumped) >> 1616 >> Segmentation fault (core dumped) >> 4185 >> >> etc. >> >> I don't seem to be able to reproduce on a 4-cpu M3000 >> >> cpu0 at core0: FJSV,SPARC64-VII (rev 10.1) @ 2750 MHz >> cpu0: physical 64K instruction (64 b/l), 64K data (64 b/l), 5120K external >> (256 b/l) > > While chatting with deraadt@ about this he pointed out my > statement about the stack being clobbered didn’t make much > sense. Looking closer at the core file data it appears that > the registers of the main thread don’t appear to be correct > when the process segfaults. > > In the test program each thread has its own mutex and > cond_var. The main thread should be utilizing one of the > per-thread mutexes and cond_vars. The core files are > consistently crashing in the main thread with a back trace > that looks like this: > > Thread 1 (process 557006): > #0 0x0000005e81739078 in _rthread_mutex_timedlock (mutexp=0x5f39af5d98, > trywait=0, abs=0x0, timed=0) at rthread_mutex.c:163 > #1 0x0000005e8176efdc in _rthread_cond_timedwait (cond=<optimized out>, > mutexp=0x5f39af5d98, abs=0xc) at rthread_cond.c:121 > > However, the mutexp address is not one of the per-thread > mutexes. The address is not with the threads array at all: > > (gdb) p &threads > $1 = (thread_t (*)[40]) 0x5c61c02058 <threads> > (gdb) p &threads[40] > $2 = (thread_t *) 0x5c61c02698 > > mutexp is in the i0 register. It not containing a correct value > suggests the registers are not always correct after transitioning > back to user land. Perhaps there is some sort of coherency issue?
Just a reminder that this issue persists with the October 19th sparc64 snapshot on my T4-1.