On Mon, 22 Aug 2022 16:49:34 GMT, Martin Doerr <mdo...@openjdk.org> wrote:
> My concern is that we may not notice implementation problems any more when > retrying so often. Accidental cache line sharing should better get fixed in > the tests if possible. Context switching or cache capacity limits may cause 1 > failure, not 100. What do you think? I would not dare to say that 100 retries is going to be enough for everything; I don't even dare to say that is "too many", my test experiments ran with tens of thousands retries before. Here is a thing, though: for weak LL/SC hardware, we would enter the same kind (but unbounded!) retry loop inside strong CAS implementation. If the implementation is laggy, we would spin there a lot. In fact, I don't think anyone actually measured how much retries we need to succeed at strong CAS on such platforms. But as long as such CAS eventually succeeds, it looks to be a performance question, not the functionality one. The tests affected by this PR are testing the functional part: weak CAS _eventually_ succeeds after aggressive retries/backoffs. The failing platforms I have (RISC-V) are remarkably slow to dissect what might be going on there. We can dive there, for sure, but I reckon that would take weeks to resolve. It can wait, if you feel strongly about it. ------------- PR: https://git.openjdk.org/jdk/pull/9889