Hi Marc

I think the occupied memory is due to the to-remove complete checkpoints which 
are stored in the workQueue of io-executor [1] in 
ZooKeeperCompletedCheckpointStore [2]. One clue to prove this is that 
Executors#newFixedThreadPool would create a ThreadPoolExecutor with a 
LinkedBlockingQueue to store runnables.

To figure out the root cause, would you please check the information below:

  1.  How large of your checkpoint meta, you could view 
{checkpoint-dir}/chk-X/_metadata to know the size, you could provide what state 
backend you use to help know this.
  2.  What is the interval of your checkpoints, a smaller checkpoint interval 
might accumulate many completed checkpoints to subsume once a newer checkpoint 
completes.

[1] 
https://github.com/apache/flink/blob/d7e247209358779b6485062b69965b83043fb59d/flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java#L260
[2] 
https://github.com/apache/flink/blob/d7e247209358779b6485062b69965b83043fb59d/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L234

Best
Yun Tang

________________________________
From: Marc LEGER <maleger...@gmail.com>
Sent: Wednesday, April 8, 2020 16:50
To: user@flink.apache.org <user@flink.apache.org>
Subject: Possible memory leak in JobManager (Flink 1.10.0)?

Hello,

I am currently testing Flink 1.10.0 but I am facing memory issues with 
JobManagers deployed in a standalone cluster configured in HA mode with 3 
TaskManagers (and 3 running jobs).
I do not reproduce the same issues using Flink 1.7.2.

Basically, whatever the value of "jobmanager.heap.size" property is (I tried 
with 2 GB, then 4GB and finally 8GB), the leader JobManager process is 
eventually consuming all available memory and is hanging after a few hours or 
days (depending on the size of the heap) before being deassociated from the 
cluster.

I am using OpenJ9 JVM with Java 11 on CentOS 7.6 machines:
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.6+10)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.18.1, JRE 11 Linux amd64-64-Bit 
Compressed

I performed a heap dump for analysis on the JobManager Java process and 
generated a "Leak Suspects" report using Eclipse MAT.
The tool is detecting one main suspect (cf. attached screenshots):

One instance of "java.util.concurrent.ThreadPoolExecutor" loaded by "<system 
class loader>" occupies 580,468,280 (92.82%) bytes. The instance is referenced 
by org.apache.flink.runtime.highavailability.zookeeper.ZooKeeperHaServices @ 
0x8041fb48 , loaded by "<system class loader>".

Has anyone already faced such an issue ?

Best Regards,
Marc

Reply via email to