> Hi all, > > please review this change that implements (currently Draft) JEP: G1: > Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP > process is already taking very long with no end in sight but we would like to > have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble > Parallel GC's as described in the JEP. The reason is that G1 lacks in > throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent > refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers > (dirty card queues - dcq) containing the location of dirtied cards. > Refinement threads pick up their contents to re-refine. The barrier needs to > enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization > between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), > to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment > `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total > instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, > but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease > throughput by 10-20% > ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is > corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization > between refinement and mutator threads, but coarse grained based on > atomically switching card tables. Mutators only work on the "primary" card > table, refinement threads on a se...
Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: - * indentation fix - * remove support for 32 bit x86 in the barrier generation code, following latest changes from @shade ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/fcf96a2a..068d2a37 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=32 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=31-32 Stats: 5 lines in 1 file changed: 0 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739