On Thu, 14 Sep 2023 08:00:56 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:
>>> and consume the usual amount of memory. >> >> And how much is that? And at what concurrency level will we not be able to >> run these tests in parallel without potentially impacting the way they run >> i.e. running out of memory sooner than expected? >> >> I'm concerned that these set of PRs to remove exclusive testing are going to >> cause a headache for those of us who have to monitor and triage CI testing. >> If I see one of these tests fail after this change goes in, there is nothing >> to give me any hint as to what has changed - no git log for the test file >> will show me something was modified! > >> > and consume the usual amount of memory. >> >> And how much is that? And at what concurrency level will we not be able to >> run these tests in parallel without potentially impacting the way they run >> i.e. running out of memory sooner than expected? > > They run at the standard heap sizes for the tests, driven by > `MaxRAMPercentage` setup by build system. On my 18-core test servers, most of > them run with ~700 MB RSS, sometimes peaking at ~1.1G. AFAICS, this is a > common RSS for VM/GC tests. These tests eat Java heap / class memory and exit > as soon as they catch OOME or load all the classes. The extended parallelism > might delay that a bit, but I don't see this manifesting in practice. > >> I'm concerned that these set of PRs to remove exclusive testing are going to >> cause a headache for those of us who have to monitor and triage CI testing. >> If I see one of these tests fail after this change goes in, there is nothing >> to give me any hint as to what has changed - no git log for the test file >> will show me something was modified! > > True. That's one of the reasons to avoid external test configs, whether it is > `TEST.properties` near the tests, or the settings in global suite `TEST.ROOT`. > > There are two bonus points from maintenance perspective: > > 1. (technical) Note that the current `exclusiveDirs` limit the _in-group_ > parallelism. This means that there is a random chance something else is > running concurrently with these tests, if that test is outside of the this > test group. So it is not like we are deciding if these tests should run in > complete resource isolation from everything else or not -- they already are > not isolated. Which means, if tests experience resource starvation, it would > manifest pretty randomly, depending on what had been running in parallel. > Unblocking the _in-group_ parallelism allows us to make these conditions > manifesting more reliably. Which, I argue, benefits tests maintainability: if > test can fail due to resource starvation, they would do so more often than > once in a blue moon. We verify that is unlikely to happen by stress-testing > multiple iterations of these tests. > > 2. (organizational) Due to these parallelism blockages, `tier4` is > remarkably slow. It is >10x slower than `tier3`, for example, and it gets > worse as more untapped parallelism there is on the machine. Which is why I > see both ad-hoc developer and vendor testing pipelines do not run `tier4` as > frequently as they run `tier{1,2,3}`. Making `tier4` more parallel, and thus > faster to run, me... > @shipilev thanks for the broader context, but what platforms and > configurations are you actually testing on? Mostly 16..32-core x86_64 and AArch64 EC2 instances, similar to where the bulk of our testing runs. Testing with fastdebug binaries, sometimes juggling the GC and JIT selections. ------------- PR Comment: https://git.openjdk.org/jdk/pull/15689#issuecomment-1720723033