RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier

Thomas Schatzl Tue, 25 Feb 2025 07:06:12 -0800

Hi all,

  please review this change that implements (currently Draft) JEP: G1: Improve 
Application Throughput with a More Efficient Write-Barrier.


The reason for posting this early is that this is a large change, and the JEP 
process is already taking very long with no end in sight but we would like to 
have this ready by JDK 25.

### Current situation

With this change, G1 will reduce the post write barrier to much more resemble 
Parallel GC's as described in the JEP. The reason is that G1 lacks in 
throughput compared to Parallel/Serial GC due to larger barrier.

The main reason for the current barrier is how g1 implements concurrent 
refinement:
* g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers 
(dirty card queues - dcq) containing the location of dirtied cards. Refinement 
threads pick up their contents to re-refine. The barrier needs to enqueue card 
locations.
* For correctness dirty card updates requires fine-grained synchronization 
between mutator and refinement threads,
* Finally there is generic code to avoid dirtying cards altogether (filters), 
to avoid executing the synchronization and the enqueuing as much as possible.

These tasks require the current barrier to look as follows for an assignment 
`x.a = y` in pseudo code:


// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) == young_card) goto done;  // write to young gen check
StoreLoad;                // synchronize
if (card(@x.a) == dirty_card) goto done;

*card(@x.a) = dirty

// Card tracking
enqueue(card-address(@x.a)) into thread-local-dcq;
if (thread-local-dcq is not full) goto done;

call runtime to move thread-local-dcq into dcqs

done:


Overall this post-write barrier alone is in the range of 40-50 total 
instructions, compared to three or four(!) for parallel and serial gc.

The large size of the inlined barrier not only has a large code footprint, but 
also prevents some compiler optimizations like loop unrolling or inlining.

There are several papers showing that this barrier alone can decrease 
throughput by 10-20% 
([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is 
corroborated by some benchmarks (see links).

The main idea for this change is to not use fine-grained synchronization 
between refinement and mutator threads, but coarse grained based on atomically 
switching card tables. Mutators only work on the "primary" card table, 
refinement threads on a second card table ("refinement table"). The second card 
table also replaces the dirty card queue.

In that scheme the fine-grained synchronization is unnecessary because mutator 
and refinement threads always write to different memory areas (and no 
concurrent write where an update can be lost can occur). This removes the 
necessity for synchronization for every reference write.
Also no card enqueuing is required any more.

Only the filters and the card mark remain.

### How this works

In the beginning both the card table and the refinement table are completely 
unmarked (contain "clean" cards). The mutator dirties the card table, until G1 
heuristics think that a significant enough amount of cards were dirtied based 
on what is allocated for scanning them during the garbage collection.

At that point, the card table and the refinement table are exchanged 
"atomically" using handshakes. The mutator keeps dirtying the (the previous, 
clean refinement table which is now the) card table, while the refinement 
threads look for and refine dirty cards on the refinement table as before.

Refinement of cards is very similar to before: if an interesting reference in a 
dirty card has been found, G1 records it in appropriate remembered sets. In 
this implementation there is an exception for references to the current 
collection set (typically young gen) - the refinement threads redirty that card 
on the card table with a special `to-collection-set` value.

This is valid because races with the mutator for that write do not matter - the 
entire card will eventually be rescanned anyway, regardless of whether it ends 
up as dirty or to-collection-set. The advantage of marking to-collection-set 
cards specially is that the next time the card tables are swapped, the 
refinement threads will not re-refine them on the assumption that that 
reference to the collection set will not change. This decreases refinement work 
substantially.

If refinement gets interrupted by GC, the refinement table will be merged with 
the card table before card scanning, which works as before.

New barrier pseudo-code for an assignment `x.a = y`:

// Filtering
if (region(@x.a) == region(y)) goto done; // same region check
if (y == null) goto done;     // null value check
if (card(@x.a) != clean_card) goto done;  // skip already non-clean cards
*card(@x.a) = dirty

This is basically the Serial/Parallel GC barrier with additional filters to 
keep the number of dirty cards as little as possible.

A few more comments about the barrier:
* the barrier now loads the card table base offset from a thread local instead 
of inlining it. This is necessary for this mechanism to work as the card table 
to dirty changes over time, and may even be faster on some architectures (code 
size), and some architectures already do.
* all existing pre-filters were kept. Benchmarks showed some significant 
regressions wrt to pause times and even throughput compared to G1 in master. 
Using the Parallel GC barrier (just the dirty card write) would be possible, 
and further investigation on stripping parts will be made as follow-up.
* the final check tests for non-clean cards to avoid overwriting existing 
cards, in particular the "to-collection set" cards described above.

Current G1 marks the cards corresponding to young gen regions as all "young" so 
that the original barrier could potentially avoid the `StoreLoad`. This 
implementation removes this facility (which might be re-introduced later), but 
measurements showed that pre-dirtying the young generation region's cards as 
"dirty" (g1 does not need to use an extra "young" value) did not yield any 
measurable performance difference.

### Refinement process

The goal of the refinement (threads) is to make sure that the number of cards 
to scan in the garbage collection is below a particular threshold.

The prototype changes the refinement threads into a single control thread and a 
set of (refinement) worker threads. Differently to the previous implementation, 
the control thread does not do any refinement, but only executes the heuristics 
to start a calculated amount of worker threads and tracking refinement 
progress. 

The refinement trigger is based on current known number of pending (i.e. dirty) 
cards on the card table and a pending card generation rate, fairly similarly to 
the previous algorithm. After the refinement control thread determines that it 
is time to do refinement, it starts the following sequence:

1) **Swap the card table**. This consists of several steps:
    1) **Swap the global card table** - the global card table pointer is 
swapped; newly created threads and runtime calls will eventually use the new 
values, at the latest after the next two steps.
    2) **Update the pointers in all JavaThread**'s TLS storage to the new card 
table pointer using a handshake operation
    3) **Update the pointers in the GC thread**'s TLS storage to the new card 
table pointer using the SuspendibleThreadSet mechanism
2) **Snapshot the heap** - determine the extent of work needed for all regions 
where the refinement threads need to do some work on the refinement table (the 
previous card table). The snapshot stores the work progress for each region so 
that work can be interrupted and continued at any time.
    This work either consists of refinement of the particular card (old 
generation regions) or clearing the cards (next collection set/young generation 
regions).
3) **Sweep the refinement table** by activating the refinement worker threads. 
The threads refine dirty cards using the heap snapshot where worker threads 
claim parts of regions to process.
      * Cards with references to the young generation are not added to the 
young generation's card based remembered set. Instead these cards are marked as 
to-collection-set  in the card table and any remaining refinement of that card 
skipped.
      * If refinement encounters a card that is already marked as 
to-collection-set  it is not refined and re-marked as to-collection-set  on the 
card table .
      * During refinement, the refinement table is also cleared (in bulk for 
collection set regions as they do not need any refinement, and in other regions 
as they are refined for the non-clean cards).
      * Dirty cards within unparsable heap areas are forwarded to/redirtied on 
the card table as is.
4) **Completion work**, mostly statistics.

If the work is interrupted by a non-garbage collection synchronization point, 
work is suspended temporarily and resumed later using the heap snapshot.

After the refinement process the refinement table is all-clean again and ready 
to be swapped again.

### Garbage collection pause changes

Since a garbage collection (young or full gc) pause may occur at any point 
during the refinement process, the garbage collection needs some compensating 
work for the not yet swept parts of the refinement table.

Note that this situation is very rare, and the heuristics try to avoid that, so 
in most cases nothing needs to be done as the refinement table is all clean.

If this happens, young collections add a new phase called `Merge Refinement 
Table` in the garbage collection pause right before the `Merge Heap Roots` 
phase. This compensating phase does the following:

  0) (Optional) Snapshot the heap if not done yet (if the process has been 
interrupted between state 1 and 3 of the refinement process)
  1) Merge the refinement table into the card table - in this step the dirty 
cards of interesting regions are
  2) Completion work (statistics)

If a full collection interrupts concurrent refinement, the refinement table is 
simply cleared and all dirty cards thrown away.

A garbage collection generates new cards (e.g. references from promoted objects 
into the young generation) on the refinement table. This acts similarly to the 
extra DCQS used to record these interesting references/cards and redirty the 
card table using them in the previous implementation. G1 swaps the card tables 
at the end of the collection to keep the post-condition of the refinement table 
being all clean (and any to-be-refined cards on the card table) at the end of 
garbage collection.

### Performance metrics

Following is an overview of the changes in behavior. Some numbers are provided 
in the CR in the first comment.

#### Native memory usage

The refinement table takes an additional 0.2% of the Java heap size of native 
memory compared to JDK 21 and above (in JDK 21 we removed one card table sized 
data structure, so this is a non-issue when updating from before).

Some of that additional memory usage is automatically reclaimed by removing the 
dirty card queues. Additional memory is reclaimed by managing the cards 
containing to-collection-set references on the card table by dropping the 
explicit remembered sets for young generation completely and any remembered set 
entries which would otherwise be duplicated into the other region's remembered 
sets.

In some applications/benchmarks these gains completely offset the additional 
card table, however most of the time this is not the case, particularly for 
throughput applications currently.
It is possible to allocate the refinement table lazily, which means that since 
these applications often do not need any concurrent refinement, there is no 
overhead at all but actually a net reduction of native memory usage. This is 
not implemented in this prototype.

#### Latency ("Pause times")

Not affected or slightly better. Pause times decrease due to a shorter "Merge 
remembered sets" phase due to no work required for the remembered sets for the 
young generation - they are always already on the card table!

However merging of the refinement table into the card table is extremely fast 
and is always faster than merging remembered sets for the young gen in my 
measurements. Since this work is linearly scanning some memory, this is 
embarassingly parallel too.

The cards created during garbage collection do not need to be redirtied, so 
that phase has also been removed.

The card table swap is based on predictions for mutator card dirtying rate and 
refinement rate as before, and the policy is actually fairly similar to before. 
It is still rather aggressive, but in most cases takes less cpu resources than 
the one before, mostly because refining takes less cpu time. Many applications 
do not do any refinement at all like before. More investigation could be done 
to improve this in the future.

#### Throughput

This change always increases throughput in my measurements, depending on 
benchmark/application it may not actually show up in scores though.

Due to the pre-barrier and the additional filters in the barrier G1 is still 
slower than Parallel on raw throughput benchmarks, but is typically somewhere 
half-way to Parallel GC or closer.

### Platform support

Since the post write barrier changed, additional work for some platforms is 
required to allow this change to proceed. At this time all work for all 
platforms is done, but needs testing

- GraalVM (contributed by the GraalVM team)
- S390 (contributed by A. Kumar from IBM)
- PPC (contributed by M. Doerr, from SAP)
- ARM (should work, HelloWorld compiles and runs)
- RISCV (should work, HelloWorld compiles and runs)
- x86 (should work, build/HelloWorld compiles and runs)

None of the above mentioned platforms implement the barrier method to write 
cards for a reference array (aarch64 and x64 are fully implemented), they call 
the runtime as before. I believe it is doable fairly easily now with this 
simplified barrier for some extra performance, but not necessary.

### Alternatives

The JEP text extensively discusses alternatives.

### Reviewing

The change can be roughly divided in these fairly isolated parts
* platform specific changes to the barrier
* refinement and refinement control thread changes; this is best reviewed 
starting from the `G1ConcurrentRefineThread::run_service` method
* changes to garbage collection: `merge_refinement_table()` in `g1RemSet.cpp`
* policy modifications are typically related to code around the calls to 
`G1Policy::record_dirtying_stats`.

Further information is available in the [JEP 
draft](https://bugs.openjdk.org/browse/JDK-8340827); there is also an a bit 
more extensive discussion of the change on my 
[blog](https://tschatzl.github.io/2025/02/21/new-write-barriers.html).

Some additional comments:
* the pre-marking of young generation cards has been removed. Benchmarks did 
not show any significant difference either way. To me this makes somewhat sense 
because the entire young gen will quickly get marked anyway. I.e. one only 
saves a single additional card table write (for every card). With the old 
barrier the costs for a card table mark has been much higher.
* G1 sets `UseCondCardMark` to true by default. The conditional card mark 
corresponds to the third filter in the write barrier now, and since I decided 
to keep all filters for this change, it makes sense to directly use this 
mechanism.

If there are any questions, feel free to ask.

Testing: tier1-7 (multiple tier1-7, tier1-8 with slightly older versions)

Thanks,
  Thomas

-------------

Commit messages:
 - * only provide byte map base for JavaThreads
 - * mdoerr review: fix comments in ppc code
 - * fix crash when writing dirty cards for memory regions during card table 
switching
 - * remove mention of "enqueue" or "enqueuing" for actions related to post 
barrier
 - * remove some commented out debug code
 - Card table as DCQ

Changes: https://git.openjdk.org/jdk/pull/23739/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8342382
  Stats: 6543 lines in 103 files changed: 2162 ins; 3461 del; 920 mod
  Patch: https://git.openjdk.org/jdk/pull/23739.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739

PR: https://git.openjdk.org/jdk/pull/23739

RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier

Reply via email to