Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30]

Roberto Castañeda Lozano Wed, 09 Apr 2025 05:04:27 -0700

On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl <[email protected]> wrote:


>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: 
>> Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the 
>> JEP process is already taking very long with no end in sight but we would 
>> like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more 
>> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks 
>> in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent 
>> refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of 
>> buffers (dirty card queues - dcq) containing the location of dirtied cards. 
>> Refinement threads pick up their contents to re-refine. The barrier needs to 
>> enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization 
>> between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether 
>> (filters), to avoid executing the synchronization and the enqueuing as much 
>> as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment 
>> `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total 
>> instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, 
>> but also prevents some compiler optimizations like loop unrolling or 
>> inlining.
>> 
>> There are several papers showing that this barrier alone can decrease 
>> throughput by 10-20% 
>> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
>> corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization 
>> between refinement and mutator threads, but coarse grained based on 
>> atomically switching c...
>
> Thomas Schatzl has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 39 commits:
> 
>  - * missing file from merge
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq
>  - * make young gen length revising independent of refinement thread
>      * use a service task
>      * both refinement control thread and young gen length revising use the 
> same infrastructure to get the number of available bytes and determine the 
> time to the next update
>  - * fix IR code generation tests that change due to barrier cost changes
>  - * factor out card table and refinement table merging into a single
>      method
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
>  - * obsolete G1UpdateBufferSize
>    
>    G1UpdateBufferSize has previously been used to size the refinement
>    buffers and impose a minimum limit on the number of cards per thread
>    that need to be pending before refinement starts.
>    
>    The former function is now obsolete with the removal of the dirty
>    card queues, the latter functionality has been taken over by the new
>    diagnostic option `G1PerThreadPendingCardThreshold`.
>    
>    I prefer to make this a diagnostic option is better than a product option
>    because it is something that is only necessary for some test cases to
>    produce some otherwise unwanted behavior (continuous refinement).
>    
>    CSR is pending.
>  - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f

Hi Thomas, great simplification and encouraging results! I reviewed the 
compiler-related parts of the changeset, including x64 and aarch64 changes.

src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 246:

> 244:     __ cbz(new_val, done);
> 245:   }
> 246:   // Storing region crossing non-null, is card young?

Suggestion:

  // Storing region crossing non-null.

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101:

> 99: }
> 100: 
> 101: void 
> G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, 
> DecoratorSet decorators,

Have you measured the performance impact of inlining this assembly code instead 
of resorting to a runtime call as done before? Is it worth the maintenance cost 
(for every platform), risk of introducing bugs, etc.?

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 145:

> 143: 
> 144:   __ bind(is_clean_card);
> 145:   // Card was clean. Dirty card and go to next..

This code seems unreachable if `!UseCondCardMark`, meaning we only dirty cards 
here if `UseCondCardMark` is enabled. Is that intentional?

src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 319:

> 317:                                             const Register thread,
> 318:                                             const Register tmp1,
> 319:                                             const Register tmp2,

Since `tmp2` is not needed in the x64 post-barrier, I suggest not passing it 
around for this platform, for simplicity and also to make optimization 
opportunities more visible in the future. Here is my suggestion: 
https://github.com/robcasloz/jdk/commit/855ec8df4a641f8c491c5c09acea3ee434b7e230,
 feel free to merge if you agree.

src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 38:

> 36: #include "c1/c1_LIRAssembler.hpp"
> 37: #include "c1/c1_MacroAssembler.hpp"
> 38: #endif // COMPILER1

I suggest removing the conditional compilation directives and grouping these 
includes together with the above `c1` ones.

src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 147:

> 145:     state->do_input(_thread);
> 146: 
> 147:     // Use temp registers to ensure these they use different registers.

Suggestion:

    // Use temps to enforce different registers.

src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 307:

> 305:              + 6  // same region check: Uncompress (new_val) oop, xor, 
> shr, (cmp), jmp
> 306:              + 4  // new_val is null check
> 307:              + 4; // card not clean check.

It probably does not affect the unrolling heuristics too much, but you may want 
to make the last cost component conditional on `UseCondCardMark`.

src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 396:

> 394:   bool needs_liveness_data(const MachNode* mach) const {
> 395:     return G1BarrierStubC2::needs_pre_barrier(mach) ||
> 396:            G1BarrierStubC2::needs_post_barrier(mach);

Suggestion:

    // Liveness data is only required to compute registers that must be
    // preserved across the runtime call in the pre-barrier stub.
    return G1BarrierStubC2::needs_pre_barrier(mach);

src/hotspot/share/gc/g1/g1BarrierSet.hpp line 56:

> 54: //
> 55: // The refinement threads mark cards in the current collection set 
> specially on the
> 56: // card table - this is fine wrt to synchronization with the mutator, 
> because at

Suggestion:

// card table - this is fine wrt synchronization with the mutator, because at

test/hotspot/jtreg/compiler/gcbarriers/TestG1BarrierGeneration.java line 521:

> 519:         phase = CompilePhase.FINAL_CODE)
> 520:     @IR(counts = {IRNode.COUNTED_LOOP, "2"},
> 521:         phase = CompilePhase.FINAL_CODE)

I suggest to remove this extra IR check to avoid over-specifying the expected 
loop shape. For example, running this test with loop unrolling disabled 
(`-XX:LoopUnrollLimit=0`) would now fail because only one counted loop would be 
found.

-------------

Changes requested by rcastanedalo (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2753154117
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035174209
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035175921
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035177738
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035183250
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035186980
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035192666
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035210464
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035196251
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035198219
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035201056

Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v30]

Reply via email to