[jira] [Created] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Till Rohrmann (Jira) Thu, 09 Apr 2020 07:25:22 -0700

Till Rohrmann created FLINK-17073:
-------------------------------------

             Summary: Slow checkpoint cleanup causing OOMs
                 Key: FLINK-17073
                 URL: https://issues.apache.org/jira/browse/FLINK-17073
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing, Runtime / Coordination
    Affects Versions: 1.10.0, 1.9.0, 1.8.0, 1.7.3, 1.11.0
            Reporter: Till Rohrmann
             Fix For: 1.9.3, 1.10.1, 1.11.0



A user reported that he sees a decline in checkpoint cleanup speed when 
upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup tasks 
are waiting in the execution queue occupying memory. Ultimately, the JM process 
dies with an OOM.

Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is used 
by the {{HighAvailabilityServices}}. Before, we use the {{AkkaRpcService}} 
thread pool which was a {{ForkJoinPool}} with a max parallelism of 64. Now it 
is a {{FixedThreadPool}} with as many threads as CPU cores. This change might 
have caused the decline in completed checkpoint discard throughput. This 
suspicion needs to be validated before trying to fix it!

[1] 
https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Reply via email to