On Wed, 12 Mar 2025 11:58:45 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:
>> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: >> Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the >> JEP process is already taking very long with no end in sight but we would >> like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more >> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks >> in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent >> refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of >> buffers (dirty card queues - dcq) containing the location of dirtied cards. >> Refinement threads pick up their contents to re-refine. The barrier needs to >> enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization >> between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether >> (filters), to avoid executing the synchronization and the enqueuing as much >> as possible. >> >> These tasks require the current barrier to look as follows for an assignment >> `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total >> instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, >> but also prevents some compiler optimizations like loop unrolling or >> inlining. >> >> There are several papers showing that this barrier alone can decrease >> throughput by 10-20% >> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is >> corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization >> between refinement and mutator threads, but coarse grained based on >> atomically switching c... > > Thomas Schatzl has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 24 additional > commits since the last revision: > > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - * optimized RISCV gen_write_ref_array_post_barrier() implementation > contributed by @RealFYang > - * fix card table verification crashes: in the first refinement phase, when > switching the global card tables, we need to re-check whether we are still in > the same sweep epoch or not. It might have changed due to a GC interrupting > acquiring the Heap_lock. Otherwise new threads will scribble on the > refinement table. > Cause are last-minute changes before making the PR ready to review. > > Testing: without the patch, occurs fairly frequently when continuously > (1 in 20) starting refinement. Does not afterward. > - * ayang review 3 > * comments > * minor refactorings > - * iwalulya review > * renaming > * fix some includes, forward declaration > - * fix whitespace > * additional whitespace between log tags > * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the > other similar rename > - ayang review > * renamings > * refactorings > - iwalulya review > * comments for variables tracking to-collection-set and just dirtied > cards after GC/refinement > * predicate for determining whether the refinement has been disabled > * some other typos/comment improvements > * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with > naming > - * ayang review - fix comment > - * iwalulya review 2 > * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState > * some additional documentation > - ... and 14 more: https://git.openjdk.org/jdk/compare/53a66058...aec95051 src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 217: > 215: > 216: { > 217: SuspendibleThreadSetLeaver sts_leave; Can you add some comment on why leaving the set is required? It's not obvious to me why. I'd expect handshake to work out of the box... src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263: > 261: > 262: SuspendibleThreadSetLeaver sts_leave; > 263: VMThread::execute(&op); Can you elaborate what synchronization this VM op is trying to achieve? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991489399 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991382024