Hi, heap dump brings about pauses for application's execution(STW), this is a 
well-known pain. JDK-8252842 have added parallel support to heapdump in an 
attempt to alleviate this issue. However, all concurrent threads competitively 
write heap data to the same file, and more memory is required to maintain the 
concurrent buffer queue. In experiments, we did not feel a significant 
performance improvement from that.

The minor-pause solution, which is presented in this PR, is a two-stage 
segmented heap dump:

1. Stage One(STW): Concurrent threads directly write data to multiple heap 
files.
2. Stage Two(Non-STW): Merge multiple heap files into one complete heap dump 
file.

Now concurrent worker threads are not required to maintain a buffer queue, 
which would result in more memory overhead, nor do they need to compete for 
locks. It significantly reduces 73~80% application pause time. 

| memory | numOfThread | STW         | Total      |
| --- | --------- | -------------- | ------------ |
| 8g | 1 thread | 15.612 secs | 15.612 secs |
| 8g | 32 thread |  2.5617250 secs | 14.498 secs |
| 8g | 96 thread | 2.6790452 secs | 14.012 secs | 
| 16g | 1 thread | 26.278 secs | 26.278 secs |
| 16g | 32 thread |  5.2313740 secs | 26.417 secs |
| 16g | 96 thread | 6.2445556 secs | 27.141 secs |
| 32g | 1 thread | 48.149 secs | 48.149 secs |
| 32g | 32 thread | 10.7734677 secs | 61.643 secs | 
| 32g | 96 thread | 13.1522042 secs |  61.432 secs |
| 64g | 1 thread |  100.583 secs | 100.583 secs |
| 64g | 32 thread | 20.9233744 secs | 134.701 secs | 
| 64g | 96 thread | 26.7374116 secs | 126.080 secs | 
| 128g | 1 thread | 233.843 secs | 233.843 secs |
| 128g | 32 thread | 72.9945768 secs | 207.060 secs |
| 128g | 96 thread | 67.6815929 secs | 336.345 secs |

> **Total** means the total heap dump including both two phases
> **STW** means the first phase only.
> For parallel dump, **Total** = **STW** + **Merge**. For serial dump, 
> **Total** = **STW**

![image](https://user-images.githubusercontent.com/5010047/234534654-6f29a3af-dad5-46bc-830b-7449c80b4dec.png)

In actual testing, two-stage solution can lead to an increase in the overall 
time for heapdump(See table above). However, considering the reduction of STW 
time, I think it is an acceptable trade-off. Furthermore, there is still room 
for optimization in the second merge stage(e.g. sendfile/splice/copy_file_range 
instead of read+write combination). Since number of parallel dump thread has a 
considerable impact on total dump time, I added a parameter that allows users 
to specify the number of parallel dump thread they wish to run.

##### Open discussion

- Pauseless heap dump solution?
An alternative pauseless solution is to fork a child process, set the parent 
process heap to read-only, and dump the heap in child process. Once writing 
happens in parent process, child process observes them by userfaultfd and 
corresponding pages are prioritized for dumping. I'm also looking forward to 
hearing comments and discussions about this solution.

- Client parser support for segmented heap dump
This patch provides a possibility that whether heap dump needs to be complete 
or not, can the VM directly generate segmented heapdump, and let the client 
parser complete the merge process? Looking forward to hearing comments from the 
Eclipse MAT community

-------------

Commit messages:
 - JDK-8306441: Segmented heap dump

Changes: https://git.openjdk.org/jdk/pull/13667/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13667&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8306441
  Stats: 2838 lines in 11 files changed: 1006 ins; 1770 del; 62 mod
  Patch: https://git.openjdk.org/jdk/pull/13667.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13667/head:pull/13667

PR: https://git.openjdk.org/jdk/pull/13667

Reply via email to