The proposed change here is the following:
 1. Move the `vmdependencies` list head from the `Context` to the `CallSite` 
object
 2. Remove the Context object and its corresponding cleaner

(1.) fixes the crash. (2.) is because without `vmdependencies` the Context and 
its cleaner serves no purpose.

First what goes wrong without this patch.

Currently when the GC unlinks a nmethod it clean/flush out all the dependency 
contexts. These are either attached to an InstanceKlass, or a Context object 
reachable through a CallSite oop. A nmethod is unlinked when either one of its 
oops have died or it is heuristically determined to be cold and should be 
unloaded. So when the GC clean's the context through a CallSite oop, it may be 
that that CallSite object is dead. 

For ZGC, which is a concurrent generational GC (the different generations are 
collected concurrently, but in a coordinated manner), it is important that the 
unlinking is coordinated with the reclamation of this dead object. In 
generational ZGC all nmethod oops are considered as strong roots if they reside 
in the young generation and thusly can only become unreachable / dead after 
promotion to the old generation. This means that the CallSite object at the 
time of unlinking is either reachable / live, or unreachable / dead and is 
reclaimed by the old generation collection (the same generation that does the 
unlinking). So we can make reading from this object safe by not reclaiming the 
object before unlinking is finished. 

The issue is that we do not have this guarantee for the Context object. As this 
is a distinct object it may be that it has not been promoted and resides in the 
young generation at the time of its CallSite object becoming unreachable and 
collected by the old generation collection.

If this is the case and a young generation collection runs after old marking 
has finished, we have two bad scenarios. If it the young generation collection 
starts after reference processing and the cleaner has run, the Context object 
would be unreachable and the young generation collection  would reclaim the 
memory. If it started before the reference processing it would still be 
reachable, but may be relocated. 

For reachable old CallSite objects the Context oop field would have been 
tracked and remapped if a young collection relocates the Context object, 
however this is not true then the CallSite is unreachable. The Context object 
may have moved or been reclaimed, and the load barrier on the field will 
produce a bogus oop because the field will not have been tracked (and remapped) 
correctly. This will cause the crash.

The solution in this patch is to remove the Context object and store the 
`vmdependencies` list head directly in the CallSite object. This avoids any 
cross generation pointers. 

An alternative would have been to keep the cleaner and change the GC code to 
only look at CallSite object which are reachable. However this would require a 
per-GC adaption and would be a larger change. (Even if it is only really ZGC 
which would need to do anything, maybe that generational Shenandoah is also 
affected, but not something I have looked into)

I've tried to dig and find why a cleaner was used in the first place as it 
seems like the GC has cleaned these / removed unloading nmethods for a while. 
Maybe someone else have more insight.

The lifetime of the native memory is now solely tied to the lifetime of the 
nmthod it tracks a dependency for. This means that it is important that after 
`add_dependent_nmethod` it will eventually be seen by and can be unlinked by 
the GC. As far as I can tell this seem to be the case, `add_dependent_nmethod` 
is called after a nmethod has been created in the code cache. 

Tested:
 * Github Actions
 * Oracle supported platforms tier1 through 8

-------------

Commit messages:
 - 8347564: ZGC: Crash in DependencyContext::clean_unloading_dependents

Changes: https://git.openjdk.org/jdk/pull/23194/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23194&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8347564
  Stats: 168 lines in 13 files changed: 8 ins; 143 del; 17 mod
  Patch: https://git.openjdk.org/jdk/pull/23194.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23194/head:pull/23194

PR: https://git.openjdk.org/jdk/pull/23194

Reply via email to