On Sun, 23 Feb 2025 18:53:33 GMT, Thomas Schatzl <tscha...@openjdk.org> wrote:
> Hi all, > > please review this change that implements (currently Draft) JEP: G1: > Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP > process is already taking very long with no end in sight but we would like to > have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble > Parallel GC's as described in the JEP. The reason is that G1 lacks in > throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent > refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers > (dirty card queues - dcq) containing the location of dirtied cards. > Refinement threads pick up their contents to re-refine. The barrier needs to > enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization > between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), > to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment > `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total > instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, > but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease > throughput by 10-20% > ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is > corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization > between refinement and mutator threads, but coarse grained based on > atomically switching card tables. Mutators only work on the "primary" card > table, refinement threads on a se... PPC64 code looks great! Thanks for doing this! Only some comments are no longer correct. src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 244: > 242: > 243: __ xorr(R0, store_addr, new_val); // tmp1 := > store address ^ new value > 244: __ srdi_(R0, R0, G1HeapRegion::LogOfHRGrainBytes); // tmp1 := > ((store address ^ new value) >> LogOfHRGrainBytes) Comment: R0 is used instead of tmp1 src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 259: > 257: > 258: __ ld(tmp1, G1ThreadLocalData::card_table_base_offset(), thread); > 259: __ srdi(tmp2, store_addr, CardTable::card_shift()); // tmp1 := > card address relative to card table base Comment: tmp2 is used, here src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 261: > 259: __ srdi(tmp2, store_addr, CardTable::card_shift()); // tmp1 := > card address relative to card table base > 260: if (UseCondCardMark) { > 261: __ lbzx(R0, tmp1, tmp2); // tmp1 := > card address Can you remove the comment, please? It's wrong. ------------- PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2637143540 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967669777 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967670850 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967671593