[ https://issues.apache.org/jira/browse/FLINK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephan Ewen reassigned FLINK-5142: ----------------------------------- Assignee: Stephan Ewen > Resource leak in CheckpointCoordinator > -------------------------------------- > > Key: FLINK-5142 > URL: https://issues.apache.org/jira/browse/FLINK-5142 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing > Affects Versions: 1.1.1, 1.1.2, 1.1.3 > Reporter: Frank Lauterwald > Assignee: Stephan Ewen > Fix For: 1.2.0 > > > We run Flink 1.1.3 with a fairly aggressive time between checkpoints and a > minimum interval between checkpoints to make sure that some work gets done > between checkpoints. > Over time, the JobManager uses more and more CPU time until it saturates the > available cores. It does not show heavy I/O load and the task managers seem > to work without problems. > We see lots of log messages of the form "Trying to trigger another checkpoint > while one was queued already" - sometimes multiple in the same millisecond. > It seems like checkpoints are triggered way too often. > I suspect there is a resource leak in the CheckpointCoordinator which leads > to this behavior: > > // in triggerCheckpoint(long timestamp, long nextCheckpointId), line 414ff > // introduced as part of FLINK-3492 > if (lastTriggeredCheckpoint + minPauseBetweenCheckpoints > timestamp) { > if (currentPeriodicTrigger != null) { > currentPeriodicTrigger.cancel(); > currentPeriodicTrigger = null; > } > ScheduledTrigger trigger = new ScheduledTrigger(); > timer.scheduleAtFixedRate(trigger, minPauseBetweenCheckpoints, > baseInterval); > return false; > } > The newly created trigger is not assigned to currentPeriodicTrigger, so it > cannot be cancelled whenever another rescheduling is required. > If rescheduling is common (it happens several times per minute for us), the > running triggers accumulate until they overwhelm the JobManager. > Versions up to Flink 1.0.x are unaffected because FLINK-3492 is a Flink 1.1 > feature. > The issue seems to be already fixed in master by commit 8854d75c due to > (unrelated) work on FLINK-4322. > Let me know if there's anything else I can do to help. -- This message was sent by Atlassian JIRA (v6.3.4#6332)