Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17]

Albert Mingkun Yang Wed, 12 Mar 2025 06:36:31 -0700

On Wed, 12 Mar 2025 11:58:45 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:


>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: 
>> Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the 
>> JEP process is already taking very long with no end in sight but we would 
>> like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more 
>> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks 
>> in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent 
>> refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of 
>> buffers (dirty card queues - dcq) containing the location of dirtied cards. 
>> Refinement threads pick up their contents to re-refine. The barrier needs to 
>> enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization 
>> between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether 
>> (filters), to avoid executing the synchronization and the enqueuing as much 
>> as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment 
>> `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total 
>> instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, 
>> but also prevents some compiler optimizations like loop unrolling or 
>> inlining.
>> 
>> There are several papers showing that this barrier alone can decrease 
>> throughput by 10-20% 
>> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
>> corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization 
>> between refinement and mutator threads, but coarse grained based on 
>> atomically switching c...
>
> Thomas Schatzl has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 24 additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - * optimized RISCV gen_write_ref_array_post_barrier() implementation 
> contributed by @RealFYang
>  - * fix card table verification crashes: in the first refinement phase, when 
> switching the global card tables, we need to re-check whether we are still in 
> the same sweep epoch or not. It might have changed due to a GC interrupting 
> acquiring the Heap_lock. Otherwise new threads will scribble on the 
> refinement table.
>    Cause are last-minute changes before making the PR ready to review.
>    
>      Testing: without the patch, occurs fairly frequently when continuously
>    (1 in 20) starting refinement. Does not afterward.
>  - * ayang review 3
>      * comments
>      * minor refactorings
>  - * iwalulya review
>      * renaming
>      * fix some includes, forward declaration
>  - * fix whitespace
>    * additional whitespace between log tags
>    * rename G1ConcurrentRefineWorkTask -> ...SweepTask to conform to the 
> other similar rename
>  - ayang review
>      * renamings
>      * refactorings
>  - iwalulya review
>      * comments for variables tracking to-collection-set and just dirtied 
> cards after GC/refinement
>      * predicate for determining whether the refinement has been disabled
>      * some other typos/comment improvements
>      * renamed _has_xxx_ref to _has_ref_to_xxx to be more consistent with 
> naming
>  - * ayang review - fix comment
>  - * iwalulya review 2
>      * G1ConcurrentRefineWorkState -> G1ConcurrentRefineSweepState
>      * some additional documentation
>  - ... and 14 more: https://git.openjdk.org/jdk/compare/53a66058...aec95051

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 217:

> 215: 
> 216:   {
> 217:     SuspendibleThreadSetLeaver sts_leave;

Can you add some comment on why leaving the set is required? It's not obvious 
to me why. I'd expect handshake to work out of the box...

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 263:

> 261: 
> 262:     SuspendibleThreadSetLeaver sts_leave;
> 263:     VMThread::execute(&op);

Can you elaborate what synchronization this VM op is trying to achieve?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991489399
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1991382024

Re: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v17]

Reply via email to