On Thu, 14 Sep 2023 08:00:56 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:

>>>  and consume the usual amount of memory.
>> 
>> And how much is that? And at what concurrency level will we not be able to 
>> run these tests in parallel without potentially impacting the way they run 
>> i.e. running out of memory sooner than expected?
>> 
>> I'm concerned that these set of PRs to remove exclusive testing are going to 
>> cause a headache for those of us who have to monitor and triage CI testing. 
>> If I see one of these tests fail after this change goes in, there is nothing 
>> to give me any hint as to what has changed - no git log for the test file 
>> will show me something was modified!
>
>> > and consume the usual amount of memory.
>> 
>> And how much is that? And at what concurrency level will we not be able to 
>> run these tests in parallel without potentially impacting the way they run 
>> i.e. running out of memory sooner than expected?
> 
> They run at the standard heap sizes for the tests, driven by 
> `MaxRAMPercentage` setup by build system. On my 18-core test servers, most of 
> them run with ~700 MB RSS, sometimes peaking at ~1.1G. AFAICS, this is a 
> common RSS for VM/GC tests. These tests eat Java heap / class memory and exit 
> as soon as they catch OOME or load all the classes. The extended parallelism 
> might delay that a bit, but I don't see this manifesting in practice. 
> 
>> I'm concerned that these set of PRs to remove exclusive testing are going to 
>> cause a headache for those of us who have to monitor and triage CI testing. 
>> If I see one of these tests fail after this change goes in, there is nothing 
>> to give me any hint as to what has changed - no git log for the test file 
>> will show me something was modified!
> 
> True. That's one of the reasons to avoid external test configs, whether it is 
> `TEST.properties` near the tests, or the settings in global suite `TEST.ROOT`.
> 
> There are two bonus points from maintenance perspective:
> 
>  1. (technical) Note that the current `exclusiveDirs` limit the _in-group_ 
> parallelism. This means that there is a random chance something else is 
> running concurrently with these tests, if that test is outside of the this 
> test group. So it is not like we are deciding if these tests should run in 
> complete resource isolation from everything else or not -- they already are 
> not isolated. Which means, if tests experience resource starvation, it would 
> manifest pretty randomly, depending on what had been running in parallel. 
> Unblocking the _in-group_ parallelism allows us to make these conditions 
> manifesting more reliably. Which, I argue, benefits tests maintainability: if 
> test can fail due to resource starvation, they would do so more often than 
> once in a blue moon. We verify that is unlikely to happen by stress-testing 
> multiple iterations of these tests.
>  
>  2. (organizational) Due to these parallelism blockages, `tier4` is 
> remarkably slow. It is >10x slower than `tier3`, for example, and it gets 
> worse as more untapped parallelism there is on the machine. Which is why I 
> see both ad-hoc developer and vendor testing pipelines do not run `tier4` as 
> frequently as they run `tier{1,2,3}`. Making `tier4` more parallel, and thus 
> faster to run, me...

> @shipilev thanks for the broader context, but what platforms and 
> configurations are you actually testing on?

Mostly 16..32-core x86_64 and AArch64 EC2 instances, similar to where the bulk 
of our testing runs. Testing with fastdebug binaries, sometimes juggling the GC 
and JIT selections.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15689#issuecomment-1720723033

Reply via email to