Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17]

Thomas Schatzl Wed, 12 Mar 2025 04:59:30 -0700

> Hi all,
> 
>   please review this change that implements (currently Draft) JEP: G1: 
> Improve Application Throughput with a More Efficient Write-Barrier.
> 
> The reason for posting this early is that this is a large change, and the JEP 
> process is already taking very long with no end in sight but we would like to 
> have this ready by JDK 25.
> 
> ### Current situation
> 
> With this change, G1 will reduce the post write barrier to much more resemble 
> Parallel GC's as described in the JEP. The reason is that G1 lacks in 
> throughput compared to Parallel/Serial GC due to larger barrier.
> 
> The main reason for the current barrier is how g1 implements concurrent 
> refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers 
> (dirty card queues - dcq) containing the location of dirtied cards. 
> Refinement threads pick up their contents to re-refine. The barrier needs to 
> enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization 
> between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters), 
> to avoid executing the synchronization and the enqueuing as much as possible.
> 
> These tasks require the current barrier to look as follows for an assignment 
> `x.a = y` in pseudo code:
> 
> 
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done;     // null value check
> if (card(@x.a) == young_card) goto done;  // write to young gen check
> StoreLoad;                // synchronize
> if (card(@x.a) == dirty_card) goto done;
> 
> *card(@x.a) = dirty
> 
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
> 
> call runtime to move thread-local-dcq into dcqs
> 
> done:
> 
> 
> Overall this post-write barrier alone is in the range of 40-50 total 
> instructions, compared to three or four(!) for parallel and serial gc.
> 
> The large size of the inlined barrier not only has a large code footprint, 
> but also prevents some compiler optimizations like loop unrolling or inlining.
> 
> There are several papers showing that this barrier alone can decrease 
> throughput by 10-20% 
> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
> corroborated by some benchmarks (see links).
> 
> The main idea for this change is to not use fine-grained synchronization 
> between refinement and mutator threads, but coarse grained based on 
> atomically switching card tables. Mutators only work on the "primary" card 
> table, refinement threads on a se...


Thomas Schatzl has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 24 additional commits 
since the last revision:

 - Merge branch 'master' into 8342382-card-table-instead-of-dcq
 - * optimized RISCV gen_write_ref_array_post_barrier() implementation 
contributed by @RealFYang
 - * fix card table verification crashes: in the first refinement phase, when 
switching the global card tables, we need to re-check whether we are still in 
the same sweep epoch or not. It might have changed due to a GC interrupting 
acquiring the Heap_lock. Otherwise new threads will scribble on the refinement 
table.
   Cause are last-minute changes before making the PR ready to review.
   
     Testing: without the patch, occurs fairly frequently when continuously
   (1 in 20) starting refinement. Does not afterward.
 - * ayang review 3
     * comments
     * minor refactorings
 - * iwalulya review
     * renaming
     * fix some includes, forward declaration
 - * fix whitespace
   * additional whitespace between log tags
   * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the other 
similar rename
 - ayang review
     * renamings
     * refactorings
 - iwalulya review
     * comments for variables tracking to-collection-set and just dirtied cards 
after GC/refinement
     * predicate for determining whether the refinement has been disabled
     * some other typos/comment improvements
     * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with naming
 - * ayang review - fix comment
 - * iwalulya review 2
     * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
     * some additional documentation
 - ... and 14 more: https://git.openjdk.org/jdk/compare/f77fa17b...aec95051

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23739/files
  - new: https://git.openjdk.org/jdk/pull/23739/files/758fac01..aec95051

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=16
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=15-16

  Stats: 78123 lines in 1539 files changed: 36243 ins; 29177 del; 12703 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17]

Reply via email to