[ https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166310#comment-17166310 ]
Etienne Chauchot commented on FLINK-17073: ------------------------------------------ [~SleePy] sure, I'll update the google doc to add impl plan. > Slow checkpoint cleanup causing OOMs > ------------------------------------ > > Key: FLINK-17073 > URL: https://issues.apache.org/jira/browse/FLINK-17073 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0 > Reporter: Till Rohrmann > Assignee: Etienne Chauchot > Priority: Major > Fix For: 1.12.0 > > > A user reported that he sees a decline in checkpoint cleanup speed when > upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup > tasks are waiting in the execution queue occupying memory. Ultimately, the JM > process dies with an OOM. > Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is > used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the > {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max > parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as > CPU cores. This change might have caused the decline in completed checkpoint > discard throughput. This suspicion needs to be validated before trying to fix > it! > [1] > https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)