> Hi all, > > please review this change that implements (currently Draft) JEP: G1: > Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP > process is already taking very long with no end in sight but we would like to > have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble > Parallel GC's as described in the JEP. The reason is that G1 lacks in > throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent > refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers > (dirty card queues - dcq) containing the location of dirtied cards. > Refinement threads pick up their contents to re-refine. The barrier needs to > enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization > between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), > to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment > `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total > instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, > but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease > throughput by 10-20% > ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is > corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization > between refinement and mutator threads, but coarse grained based on > atomically switching card tables. Mutators only work on the "primary" card > table, refinement threads on a se...
Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 28 commits: - Merge branch 'master' into 8342381-card-table-instead-of-dcq - * ayang review * re-add STS leaver for java thread handshake - * when aborting refinement during full collection, the global card table and the per-thread card table might not be in sync. Roll forward during abort of the refinement in these situations. * additional verification * added some missing ResourceMarks in asserts * added variant of ArrayJuggle2 that crashes fairly quickly without these changes - * ayang review * remove unnecessary STSleaver * some more documentation around to_collection_card card color - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. Cause are last-minute changes before making the PR ready to review. Testing: without the patch, occurs fairly frequently when continuously (1 in 20) starting refinement. Does not afterward. - * ayang review 3 * comments * minor refactorings - * iwalulya review * renaming * fix some includes, forward declaration - * fix whitespace * additional whitespace between log tags * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename - ... and 18 more: https://git.openjdk.org/jdk/compare/7f428041...b0730176 ------------- Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=20 Stats: 6761 lines in 99 files changed: 2368 ins; 3464 del; 929 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739