On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change that implements (currently Draft) JEP: G1: 
>> Improve Application Throughput with a More Efficient Write-Barrier.
>> 
>> The reason for posting this early is that this is a large change, and the 
>> JEP process is already taking very long with no end in sight but we would 
>> like to have this ready by JDK 25.
>> 
>> ### Current situation
>> 
>> With this change, G1 will reduce the post write barrier to much more 
>> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks 
>> in throughput compared to Parallel/Serial GC due to larger barrier.
>> 
>> The main reason for the current barrier is how g1 implements concurrent 
>> refinement:
>> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of 
>> buffers (dirty card queues - dcq) containing the location of dirtied cards. 
>> Refinement threads pick up their contents to re-refine. The barrier needs to 
>> enqueue card locations.
>> * For correctness dirty card updates requires fine-grained synchronization 
>> between mutator and refinement threads,
>> * Finally there is generic code to avoid dirtying cards altogether 
>> (filters), to avoid executing the synchronization and the enqueuing as much 
>> as possible.
>> 
>> These tasks require the current barrier to look as follows for an assignment 
>> `x.a = y` in pseudo code:
>> 
>> 
>> // Filtering
>> if (region(@x.a) == region(y)) goto done; // same region check
>> if (y == null) goto done;     // null value check
>> if (card(@x.a) == young_card) goto done;  // write to young gen check
>> StoreLoad;                // synchronize
>> if (card(@x.a) == dirty_card) goto done;
>> 
>> *card(@x.a) = dirty
>> 
>> // Card tracking
>> enqueue(card-address(@x.a)) into thread-local-dcq;
>> if (thread-local-dcq is not full) goto done;
>> 
>> call runtime to move thread-local-dcq into dcqs
>> 
>> done:
>> 
>> 
>> Overall this post-write barrier alone is in the range of 40-50 total 
>> instructions, compared to three or four(!) for parallel and serial gc.
>> 
>> The large size of the inlined barrier not only has a large code footprint, 
>> but also prevents some compiler optimizations like loop unrolling or 
>> inlining.
>> 
>> There are several papers showing that this barrier alone can decrease 
>> throughput by 10-20% 
>> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
>> corroborated by some benchmarks (see links).
>> 
>> The main idea for this change is to not use fine-grained synchronization 
>> between refinement and mutator threads, but coarse grained based on 
>> atomically switching c...
>
> Thomas Schatzl has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 39 commits:
> 
>  - * missing file from merge
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq
>  - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq
>  - * make young gen length revising independent of refinement thread
>      * use a service task
>      * both refinement control thread and young gen length revising use the 
> same infrastructure to get the number of available bytes and determine the 
> time to the next update
>  - * fix IR code generation tests that change due to barrier cost changes
>  - * factor out card table and refinement table merging into a single
>      method
>  - Merge branch 'master' into 8342382-card-table-instead-of-dcq3
>  - * obsolete G1UpdateBufferSize
>    
>    G1UpdateBufferSize has previously been used to size the refinement
>    buffers and impose a minimum limit on the number of cards per thread
>    that need to be pending before refinement starts.
>    
>    The former function is now obsolete with the removal of the dirty
>    card queues, the latter functionality has been taken over by the new
>    diagnostic option `G1PerThreadPendingCardThreshold`.
>    
>    I prefer to make this a diagnostic option is better than a product option
>    because it is something that is only necessary for some test cases to
>    produce some otherwise unwanted behavior (continuous refinement).
>    
>    CSR is pending.
>  - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 170:

> 168:   }
> 169:   return result;
> 170: }

I see in `G1ConcurrentRefineThread::do_refinement`:

      // The yielding may have completed the task, check.
      if (!state.is_in_progress()) {

I wonder if it's simpler to use `is_in_progress` consistently to detect whether 
we should restart sweep, instead of `_sweep_start_epoch`.

src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349:

> 347:   }
> 348: 
> 349:   bool has_sweep_rt_work = is_in_progress() && _state == State::SweepRT;

Why `is_in_progress()`?

src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 79:

> 77: 
> 78:   void inc_cards_scanned(size_t increment = 1) { _cards_scanned += 
> increment; }
> 79:   void inc_cards_clean(size_t increment = 1) { _cards_clean += increment; 
> }

The sole caller always passes in arg, so no need for default-arg-value.

src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 87:

> 85:   void add_atomic(G1ConcurrentRefineStats* other);
> 86: 
> 87:   G1ConcurrentRefineStats& operator+=(const G1ConcurrentRefineStats& 
> other);

Seems that these operators are not used after this PR.

src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 83:

> 81:         break;
> 82:       }
> 83:       case G1RemSet::HasRefToOld : break; // Nothing special to do.

Why doesn't call `inc_cards_clean_again` in this case? The card is cleared 
also. (In fact, I don't get why this needs to a separate case from 
`NoInteresting`.)

src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 156:

> 154: 
> 155:       _refine_stats.inc_cards_scanned(claim.size());
> 156:       _refine_stats.inc_cards_clean(claim.size() - scanned);

I feel these two "scanned" mean sth diff; the local var should probably be sth 
like `num_dirty_cards`.

src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 207:

> 205: 
> 206:   if (!interrupted_by_gc) {
> 207:     
> state.add_yield_duration(G1CollectedHeap::heap()->safepoint_duration() - 
> synchronize_duration_at_sweep_start);

I think this is recorded to later calculate actual refine-time, i.e. sweep-time 
- yield-time. However, why can't yield-duration be recorded in this 
refine-control-thread directly --  accumulation of `jlong yield_duration = 
os::elapsed_counter() - yield_start`. I feel that is easier to reason than 
going through g1heap.

src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 75:

> 73:   {
> 74:     MutexLocker x(G1ReviseYoungLength_lock, 
> Mutex::_no_safepoint_check_flag);
> 75:     G1Policy* p = g1h->policy();

Can probably use the existing `policy`.

src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 88:

> 86: }
> 87: 
> 88: 
> G1ReviseYoungLengthTargetLengthTask::G1ReviseYoungLengthTargetLengthTask(const
>  char* name) :

I wonder if the class name can be shortened a bit, sth like 
`G1ReviseYoungLengthTask`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033251162
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033222407
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033929489
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033975054
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033934399
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033910496
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2032008908
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855278
PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855435

Reply via email to