> Hi all, > > please review this change that implements (currently Draft) JEP: G1: > Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP > process is already taking very long with no end in sight but we would like to > have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble > Parallel GC's as described in the JEP. The reason is that G1 lacks in > throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent > refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers > (dirty card queues - dcq) containing the location of dirtied cards. > Refinement threads pick up their contents to re-refine. The barrier needs to > enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization > between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), > to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment > `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total > instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, > but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease > throughput by 10-20% > ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is > corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization > between refinement and mutator threads, but coarse grained based on > atomically switching card tables. Mutators only work on the "primary" card > table, refinement threads on a se...
Thomas Schatzl has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 24 additional commits since the last revision: - Merge branch 'master' into 8342382-card-table-instead-of-dcq - * optimized RISCV gen_write_ref_array_post_barrier() implementation contributed by @RealFYang - * fix card table verification crashes: in the first refinement phase, when switching the global card tables, we need to re-check whether we are still in the same sweep epoch or not. It might have changed due to a GC interrupting acquiring the Heap_lock. Otherwise new threads will scribble on the refinement table. Cause are last-minute changes before making the PR ready to review. Testing: without the patch, occurs fairly frequently when continuously (1 in 20) starting refinement. Does not afterward. - * ayang review 3 * comments * minor refactorings - * iwalulya review * renaming * fix some includes, forward declaration - * fix whitespace * additional whitespace between log tags * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other similar rename - ayang review * renamings * refactorings - iwalulya review * comments for variables tracking to-collection-set and just dirtied cards after GC/refinement * predicate for determining whether the refinement has been disabled * some other typos/comment improvements * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming - * ayang review - fix comment - * iwalulya review 2 * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState * some additional documentation - ... and 14 more: https://git.openjdk.org/jdk/compare/f77fa17b...aec95051 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/758fac01..aec95051 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=15-16 Stats: 78123 lines in 1539 files changed: 36243 ins; 29177 del; 12703 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739