[ https://issues.apache.org/jira/browse/FLINK-21472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295887#comment-17295887 ]
Peng Zhang commented on FLINK-21472: ------------------------------------ [~fly_in_gis] Thanks! I will try Flink 1.12.2 once it is available in docker. For more information, in our case the FencingTokenException happened when a JobManager is redeployed to another node by K8S. And the new JobManager cannot start the jobs from checkpoints {{2021-03-04 17:04:44,928 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from KubernetesStateHandleStore\{configMapName='stellar-flink-cluster-8ea8bb860bdefc3884cd586f4473295a-jobmanager-leader'}. 2021-03-04 17:04:44,928 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from KubernetesStateHandleStore\{configMapName='stellar-flink-cluster-8ea8bb860bdefc3884cd586f4473295a-jobmanager-leader'}. 2021-03-04 17:04:44,933 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Found 1 checkpoints in KubernetesStateHandleStore\{configMapName='stellar-flink-cluster-8ea8bb860bdefc3884cd586f4473295a-jobmanager-leader'}. 2021-03-04 17:04:44,933 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 2021-03-04 17:04:44,933 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 18. 2021-03-04 17:04:44,963 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 8ea8bb860bdefc3884cd586f4473295a from Checkpoint 18 @ 1614877356663 for 8ea8bb860bdefc3884cd586f4473295a located at s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/8ea8bb860bdefc3884cd586f4473295a/chk-18. 2021-03-04 17:04:44,964 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore 2021-03-04 17:04:44,965 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@530feb4d for BrandCollectionTrackingJob (8ea8bb860bdefc3884cd586f4473295a). 2021-03-04 17:04:44,970 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager runner for job BrandCollectionTrackingJob (8ea8bb860bdefc3884cd586f4473295a) was granted leadership with session id ecb717f4-089f-48af-8d82-63333f7d4b17 at akka.tcp://flink@stellar-flink-jobmanager:6123/user/rpc/jobmanager_4. 2021-03-04 17:05:09,618 WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Exec Failure java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 1 successful ping/pongs) 2021-03-04 17:05:14,990 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler [] - Unhandled exception. org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token mismatch: Ignoring message LocalFencedMessage(9c31a87cf2ff475d049819f3fb9e4cd7, LocalRpcInvocation(requestMultipleJobDetails(Time))) because the fencing token 9c31a87cf2ff475d049819f3fb9e4cd7 did not match the expected fencing token bbc60d6ee1cc9717561f755149454d94.}} > FencingTokenException: Fencing token mismatch > --------------------------------------------- > > Key: FLINK-21472 > URL: https://issues.apache.org/jira/browse/FLINK-21472 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.12.1 > Reporter: hayden zhou > Priority: Major > Attachments: > flink--standalonesession-0-mta-flink-jobmanager-864d6c8cbb-rmsxw.log > > > org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler [] - Unhandled > exception. > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > mismatch: Ignoring message > LocalFencedMessage(8fac01d8e3e3988223a2e5c6e3f04f1e, > LocalRpcInvocation(requestMultipleJobDetails(Time))) because the fencing > token 8fac01d8e3e3988223a2e5c6e3f04f1e did not match the expected fencing > token 8c37414f464bca76144e6cabc946474b. -- This message was sent by Atlassian Jira (v8.3.4#803005)