On Fri, 4 Apr 2025 08:10:34 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:
>> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: >> Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the >> JEP process is already taking very long with no end in sight but we would >> like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more >> resemble Parallel GC's as described in the JEP. The reason is that G1 lacks >> in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent >> refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of >> buffers (dirty card queues - dcq) containing the location of dirtied cards. >> Refinement threads pick up their contents to re-refine. The barrier needs to >> enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization >> between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether >> (filters), to avoid executing the synchronization and the enqueuing as much >> as possible. >> >> These tasks require the current barrier to look as follows for an assignment >> `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total >> instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, >> but also prevents some compiler optimizations like loop unrolling or >> inlining. >> >> There are several papers showing that this barrier alone can decrease >> throughput by 10-20% >> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is >> corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization >> between refinement and mutator threads, but coarse grained based on >> atomically switching c... > > Thomas Schatzl has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 39 commits: > > - * missing file from merge > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into 8342382-card-table-instead-of-dcq > - Merge branch 'master' into submit/8342382-card-table-instead-of-dcq > - * make young gen length revising independent of refinement thread > * use a service task > * both refinement control thread and young gen length revising use the > same infrastructure to get the number of available bytes and determine the > time to the next update > - * fix IR code generation tests that change due to barrier cost changes > - * factor out card table and refinement table merging into a single > method > - Merge branch 'master' into 8342382-card-table-instead-of-dcq3 > - * obsolete G1UpdateBufferSize > > G1UpdateBufferSize has previously been used to size the refinement > buffers and impose a minimum limit on the number of cards per thread > that need to be pending before refinement starts. > > The former function is now obsolete with the removal of the dirty > card queues, the latter functionality has been taken over by the new > diagnostic option `G1PerThreadPendingCardThreshold`. > > I prefer to make this a diagnostic option is better than a product option > because it is something that is only necessary for some test cases to > produce some otherwise unwanted behavior (continuous refinement). > > CSR is pending. > - ... and 29 more: https://git.openjdk.org/jdk/compare/41d4a0d7...1c5a669f src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 170: > 168: } > 169: return result; > 170: } I see in `G1ConcurrentRefineThread::do_refinement`: // The yielding may have completed the task, check. if (!state.is_in_progress()) { I wonder if it's simpler to use `is_in_progress` consistently to detect whether we should restart sweep, instead of `_sweep_start_epoch`. src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349: > 347: } > 348: > 349: bool has_sweep_rt_work = is_in_progress() && _state == State::SweepRT; Why `is_in_progress()`? src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 79: > 77: > 78: void inc_cards_scanned(size_t increment = 1) { _cards_scanned += > increment; } > 79: void inc_cards_clean(size_t increment = 1) { _cards_clean += increment; > } The sole caller always passes in arg, so no need for default-arg-value. src/hotspot/share/gc/g1/g1ConcurrentRefineStats.hpp line 87: > 85: void add_atomic(G1ConcurrentRefineStats* other); > 86: > 87: G1ConcurrentRefineStats& operator+=(const G1ConcurrentRefineStats& > other); Seems that these operators are not used after this PR. src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 83: > 81: break; > 82: } > 83: case G1RemSet::HasRefToOld : break; // Nothing special to do. Why doesn't call `inc_cards_clean_again` in this case? The card is cleared also. (In fact, I don't get why this needs to a separate case from `NoInteresting`.) src/hotspot/share/gc/g1/g1ConcurrentRefineSweepTask.cpp line 156: > 154: > 155: _refine_stats.inc_cards_scanned(claim.size()); > 156: _refine_stats.inc_cards_clean(claim.size() - scanned); I feel these two "scanned" mean sth diff; the local var should probably be sth like `num_dirty_cards`. src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp line 207: > 205: > 206: if (!interrupted_by_gc) { > 207: > state.add_yield_duration(G1CollectedHeap::heap()->safepoint_duration() - > synchronize_duration_at_sweep_start); I think this is recorded to later calculate actual refine-time, i.e. sweep-time - yield-time. However, why can't yield-duration be recorded in this refine-control-thread directly -- accumulation of `jlong yield_duration = os::elapsed_counter() - yield_start`. I feel that is easier to reason than going through g1heap. src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 75: > 73: { > 74: MutexLocker x(G1ReviseYoungLength_lock, > Mutex::_no_safepoint_check_flag); > 75: G1Policy* p = g1h->policy(); Can probably use the existing `policy`. src/hotspot/share/gc/g1/g1ReviseYoungListTargetLengthTask.cpp line 88: > 86: } > 87: > 88: > G1ReviseYoungLengthTargetLengthTask::G1ReviseYoungLengthTargetLengthTask(const > char* name) : I wonder if the class name can be shortened a bit, sth like `G1ReviseYoungLengthTask`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033251162 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033222407 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033929489 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033975054 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033934399 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2033910496 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2032008908 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855278 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r2029855435