[jira] [Comment Edited] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Etienne Chauchot (Jira) Tue, 28 Jul 2020 09:10:19 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166480#comment-17166480
 ]


Etienne Chauchot edited comment on FLINK-17073 at 7/28/20, 4:09 PM:
--------------------------------------------------------------------

[~roman_khachatryan] 

When [~SleePy] and I discussed in [the deisgn 
doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing],
 the idea was to wait until last checkpoint was cleaned before accepting 
another (that is what we called make cleaning part of checkpoint processing). 
Thus, checking only existing number of pending checkpoints was enough (no need 
for a new queue) to foresee an flood of checkpoints to clean. 

But the solution you propose (managing the queue of the checkpoints to clean 
and monitor its size) seems even simpler to me: it avoids having to sync normal 
checkpointing and checkpoint cleaning:

As you said, when we chose a checkpoint trigger request to execute 
(*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint 
requests when there are too many checkpoints to clean and thus regulate the 
whole checkpointing system. Syncing cleaning and checkpointing might not be 
necessary for this regulation, you're right.

If you don't mind, I'll go for this implementation proposal in the design doc.

[~roman_khachatryan] thanks anyway for the suggestions and please take a look 
at the design doc where we will have the impl discussions


was (Author: echauchot):
[~roman_khachatryan] 

When [~SleePy] and I discussed in [the deisgn 
doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing],
 the idea was to wait until last checkpoint was cleaned before accepting 
another (that is what we called make cleaning part of checkpoint processing). 
Thus, checking only existing number of pending checkpoints was enough (no need 
for a new queue) to foresee an flood of checkpoints to clean. 

But the solution you propose (managing the queue of the checkpoints to clean 
and monitor its size) seems even simpler to me: it avoids having to sync normal 
checkpointing and checkpoint cleaning:

As you said, when we chose a checkpoint trigger request to execute 
(*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint 
requests when there are too many checkpoints to clean and thus regulate the 
whole checkpointing system. Syncing cleaning and checkpointing might not be 
necessary for this regulation, you're right.

If you don't mind, I'll go for this implementation proposal in the design doc.

> Slow checkpoint cleanup causing OOMs
> ------------------------------------
>
>                 Key: FLINK-17073
>                 URL: https://issues.apache.org/jira/browse/FLINK-17073
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>            Reporter: Till Rohrmann
>            Assignee: Etienne Chauchot
>            Priority: Major
>             Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-17073) Slow checkpoint cleanup causing OOMs

Reply via email to