On Aug 16, 2023, at 4:14 PM, Kurt Miller <k...@intricatesoftware.com> wrote:
> 
>> On Aug 14, 2023, at 5:42 PM, Theo Buehler <t...@theobuehler.org> wrote:
>> 
>> On Mon, Aug 14, 2023 at 08:47:22PM +0000, Miod Vallat wrote:
>>> For what it's worth, I couldn't get your test to fail on a dual-cpu
>>> sun4u. Either it's a sun4v-specific issue or it needs many more cpus to
>>> trigger.
>> 
>> I can reproduce the segfault, but seemingly not the killed process on
>> 16-cpu LDOM ona T4-2:
>> 
>> cpu0 at mainbus0: SPARC-T4 (rev 0.0) @ 2847.862 MHz
>> 
>> Segmentation fault (core dumped)
>> 93
>> Segmentation fault (core dumped)
>> 1616
>> Segmentation fault (core dumped)
>> 4185
>> 
>> etc.
>> 
>> I don't seem to be able to reproduce on a 4-cpu M3000
>> 
>> cpu0 at core0: FJSV,SPARC64-VII (rev 10.1) @ 2750 MHz
>> cpu0: physical 64K instruction (64 b/l), 64K data (64 b/l), 5120K external 
>> (256 b/l)
> 
> While chatting with deraadt@ about this he pointed out my
> statement about the stack being clobbered didn’t make much
> sense. Looking closer at the core file data it appears that
> the registers of the main thread don’t appear to be correct
> when the process segfaults.
> 
> In the test program each thread has its own mutex and 
> cond_var. The main thread should be utilizing one of the
> per-thread mutexes and cond_vars. The core files are 
> consistently crashing in the main thread with a back trace
> that looks like this:
> 
> Thread 1 (process 557006):
> #0  0x0000005e81739078 in _rthread_mutex_timedlock (mutexp=0x5f39af5d98, 
> trywait=0, abs=0x0, timed=0) at rthread_mutex.c:163
> #1  0x0000005e8176efdc in _rthread_cond_timedwait (cond=<optimized out>, 
> mutexp=0x5f39af5d98, abs=0xc) at rthread_cond.c:121
> 
> However, the mutexp address is not one of the per-thread
> mutexes. The address is not with the threads array at all:
> 
> (gdb) p &threads
> $1 = (thread_t (*)[40]) 0x5c61c02058 <threads>
> (gdb) p &threads[40]
> $2 = (thread_t *) 0x5c61c02698
> 
> mutexp is in the i0 register. It not containing a correct value
> suggests the registers are not always correct after transitioning
> back to user land. Perhaps there is some sort of coherency issue?

Just a reminder that this issue persists with the October 19th
sparc64 snapshot on my T4-1. 

Reply via email to