On Mon, 22 Aug 2022 13:36:43 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:

> > In general, I appreciate making these tests more resilient. However, I 
> > wonder about such large numbers for retry attempts. Smells like more than 
> > just sporadic failures. Are we sure there is no bug which causes more 
> > failures? Does it really make sense to have a weak implementation on 
> > platforms with such high failure rates?
> 
> I suspect we are dealing with the accidental cache line sharing, or context 
> switching, or cache capacity limits that "break" LL/SC weak implementations. 
> Backoff and more retries seems to help to pull ourselves out of this mess.
> 
> As the alternative, we can provide the whitelist of platforms where weak CAS 
> is guaranteed to succeed. (We need to dig through if, for example, AArch64 
> LSE atomics provide more resilient progress behavior.) That would, 
> unfortunately, stop to verify that some LL/SC implementations _ever_ succeed.

My concern is that we may not notice implementation problems any more when 
retrying so often. Accidental cache line sharing should better get fixed in the 
tests if possible. Context switching or cache capacity limits may cause 1 
failure, not 100. What do you think?

-------------

PR: https://git.openjdk.org/jdk/pull/9889

Reply via email to