On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:
>> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: >> Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the >> JEP process is already taking very long with no end in sight but we would >> like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more >> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks >> in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent >> refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of >> buffers (dirty card queues - dcq) containing the location of dirtied cards. >> Refinement threads pick up their contents to re-refine. The barrier needs to >> enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization >> between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether >> (filters), to avoid executing the synchronization and the enqueuing as much >> as possible. >> >> These tasks require the current barrier to look as follows for an assignment >> `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total >> instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, >> but also prevents some compiler optimizations like loop unrolling or >> inlining. >> >> There are several papers showing that this barrier alone can decrease >> throughput by 10-20% >> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is >> corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization >> between refinement and mutator threads, but coarse grained based on >> atomically switching c... > > Thomas Schatzl has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 39 commits: > > - * missing file from merge > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq > - * make young gen length revising independent of refinement thread > * use a service task > * both refinement control thread and young gen length revising use the > same infrastructure to get the number of available bytes and determine the > time to the next update > - * fix IR code generation tests that change due to barrier cost changes > - * factor out card table and refinement table merging into a single > method > - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 > - * obsolete G1UpdateBufferSize > > G1UpdateBufferSize has previously been used to size the refinement > buffers and impose a minimum limit on the number of cards per thread > that need to be pending before refinement starts. > > The former function is now obsolete with the removal of the dirty > card queues, the latter functionality has been taken over by the new > diagnostic option `G1PerThreadPendingCardThreshold`. > > I prefer to make this a diagnostic option is better than a product option > because it is something that is only necessary for some test cases to > produce some otherwise unwanted behavior (continuous refinement). > > CSR is pending. > - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f Hi Thomas, great simplification and encouraging results! I reviewed the compiler-related parts of the changeset, including x64 and aarch64 changes. src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 246: > 244: __ cbz(new_val, done); > 245: } > 246: // Storing region crossing non-null, is card young? Suggestion: // Storing region crossing non-null. src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 101: > 99: } > 100: > 101: void > G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, > DecoratorSet decorators, Have you measured the performance impact of inlining this assembly code instead of resorting to a runtime call as done before? Is it worth the maintenance cost (for every platform), risk of introducing bugs, etc.? src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 145: > 143: > 144: __ bind(is_clean_card); > 145: // Card was clean. Dirty card and go to next.. This code seems unreachable if `!UseCondCardMark`, meaning we only dirty cards here if `UseCondCardMark` is enabled. Is that intentional? src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 319: > 317: const Register thread, > 318: const Register tmp1, > 319: const Register tmp2, Since `tmp2` is not needed in the x64 post-barrier, I suggest not passing it around for this platform, for simplicity and also to make optimization opportunities more visible in the future. Here is my suggestion: https://github.com/robcasloz/jdk/commit/855ec8df4a641f8c491c5c09acea3ee434b7e230, feel free to merge if you agree. src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 38: > 36: #include "c1/c1_LIRAssembler.hpp" > 37: #include "c1/c1_MacroAssembler.hpp" > 38: #endif // COMPILER1 I suggest removing the conditional compilation directives and grouping these includes together with the above `c1` ones. src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp line 147: > 145: state->do_input(_thread); > 146: > 147: // Use temp registers to ensure these they use different registers. Suggestion: // Use temps to enforce different registers. src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 307: > 305: + 6 // same region check: Uncompress (new_val) oop, xor, > shr, (cmp), jmp > 306: + 4 // new_val is null check > 307: + 4; // card not clean check. It probably does not affect the unrolling heuristics too much, but you may want to make the last cost component conditional on `UseCondCardMark`. src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 396: > 394: bool needs_liveness_data(const MachNode* mach) const { > 395: return G1BarrierStubC2::needs_pre_barrier(mach) || > 396: G1BarrierStubC2::needs_post_barrier(mach); Suggestion: // Liveness data is only required to compute registers that must be // preserved across the runtime call in the pre-barrier stub. return G1BarrierStubC2::needs_pre_barrier(mach); src/hotspot/share/gc/g1/g1BarrierSet.hpp line 56: > 54: // > 55: // The refinement threads mark cards in the current collection set > specially on the > 56: // card table - this is fine wrt to synchronization with the mutator, > because at Suggestion: // card table - this is fine wrt synchronization with the mutator, because at test/hotspot/jtreg/compiler/gcbarriers/TestG1BarrierGeneration.java line 521: > 519: phase = CompilePhase.FINAL_CODE) > 520: @IR(counts = {IRNode.COUNTED_LOOP, "2"}, > 521: phase = CompilePhase.FINAL_CODE) I suggest to remove this extra IR check to avoid over-specifying the expected loop shape. For example, running this test with loop unrolling disabled (`-XX:LoopUnrollLimit=0`) would now fail because only one counted loop would be found. ------------- Changes requested by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2753154117 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035174209 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035175921 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035177738 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035183250 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035186980 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035192666 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035210464 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035196251 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035198219 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2035201056