Hello,

I have standalone Flink cluster with JobManager HA.
Last night, JobManager failovered because of the connection timeout to
Zookeeper.
Job is successfully running under new leader JobManager, but when
I see the old leader JobManager log, it is trying to re-submit job and
getting errors. ( for almost 24 hours now)

Here is the log.

-----
2016-07-27 20:56:09,218 WARN
org.apache.flink.runtime.jobmanager.JobManager                -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:09     Job execution switched to status RESTARTING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,218 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Recovering checkpoints from ZooKeeper.
2016-07-27 20:56:19,218 WARN
org.apache.flink.runtime.jobmanager.JobManager                -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19     Job execution switched to status CREATED.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,219 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Found 1 checkpoints in ZooKeeper.
2016-07-27 20:56:19,221 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
- Initialized with Checkpoint 40229 @ 1469620528216 for
978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
2016-07-27 20:56:19,222 WARN
org.apache.flink.runtime.jobmanager.JobManager                -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19     Job execution switched to status RUNNING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,222 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
switched from CREATED to SCHEDULED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
switched from SCHEDULED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 WARN
org.apache.flink.runtime.jobmanager.JobManager                -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19     Job execution switched to status FAILING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
TriggerWindow(GlobalWindows(),
FoldingStateDescriptor{name=window-contents,
defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
TriggerWindow(GlobalWindows(),
FoldingStateDescriptor{name=window-contents,
defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
TriggerWindow(GlobalWindows(),
FoldingStateDescriptor{name=window-contents,
defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
switched from CREATED to CANCELED
2016-07-27 20:56:19,223 INFO
org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
 - Delaying retry of job execution for 10000 ms ...
2016-07-27 20:56:19,223 WARN
org.apache.flink.runtime.jobmanager.JobManager                -
Discard message
LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
20:56:19     Job execution switched to status RESTARTING.) because the
expected leader session ID None did not equal the received leader
session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
----

Could anyone advise me why this happens and how I can recover from
this situation? (restart JobManager?)

Regards,
Hironori Ogibayashi

Reply via email to