[ 
https://issues.apache.org/jira/browse/FLINK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen reassigned FLINK-5142:
-----------------------------------

    Assignee: Stephan Ewen

> Resource leak in CheckpointCoordinator
> --------------------------------------
>
>                 Key: FLINK-5142
>                 URL: https://issues.apache.org/jira/browse/FLINK-5142
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.1.1, 1.1.2, 1.1.3
>            Reporter: Frank Lauterwald
>            Assignee: Stephan Ewen
>             Fix For: 1.2.0
>
>
> We run Flink 1.1.3 with a fairly aggressive time between checkpoints and a 
> minimum interval between checkpoints to make sure that some work gets done 
> between checkpoints.
> Over time, the JobManager uses more and more CPU time until it saturates the 
> available cores. It does not show heavy I/O load and the task managers seem 
> to work without problems.
> We see lots of log messages of the form "Trying to trigger another checkpoint 
> while one was queued already" - sometimes multiple in the same millisecond.
> It seems like checkpoints are triggered way too often.
> I suspect there is a resource leak in the CheckpointCoordinator which leads 
> to this behavior:
>         
> // in triggerCheckpoint(long timestamp, long nextCheckpointId), line 414ff
> // introduced as part of FLINK-3492
> if (lastTriggeredCheckpoint + minPauseBetweenCheckpoints > timestamp) {
>         if (currentPeriodicTrigger != null) {
>                 currentPeriodicTrigger.cancel();
>                 currentPeriodicTrigger = null;
>         }
>         ScheduledTrigger trigger = new ScheduledTrigger();
>         timer.scheduleAtFixedRate(trigger, minPauseBetweenCheckpoints, 
> baseInterval);
>         return false;
> }
> The newly created trigger is not assigned to currentPeriodicTrigger, so it 
> cannot be cancelled whenever another rescheduling is required.
> If rescheduling is common (it happens several times per minute for us), the 
> running triggers accumulate until they overwhelm the JobManager.
> Versions up to Flink 1.0.x are unaffected because FLINK-3492 is a Flink 1.1 
> feature.
> The issue seems to be already fixed in master by commit 8854d75c due to 
> (unrelated) work on FLINK-4322.
> Let me know if there's anything else I can do to help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to