Hello, looking at your error, it should be that the file could not be found, or 
there was a communication problem. Let me ask if your flink on k8s uses the 
StatefulSet mode? Or can you find the location of ck storage now?

















At 2021-03-09 15:23:00, "Peng Zhang (Jira)" <j...@apache.org> wrote:
>Peng Zhang created FLINK-21685:
>----------------------------------
>
>             Summary: Flink JobManager failed to restart in K8S HA setup
>                 Key: FLINK-21685
>                 URL: https://issues.apache.org/jira/browse/FLINK-21685
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.12.2, 1.12.1
>            Reporter: Peng Zhang
>         Attachments: flink-ha.log
>
>We use Flink K8S session cluster with HA mode (1 JobManager and 4 
>TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink 
>JobManager failed to recover job from checkpoint
>
> 
>
>{{2021-03-08 13:16:42,962 INFO  
>org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
>Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO  
>org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
>Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO  
>org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
>Trying to retrieve checkpoint 1. 2021-03-08 13:16:43,014 INFO  
>org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring 
>job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 
>9a534b2e309b24f78866b65d94082ead located at 
>s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
> 2021-03-08 13:16:43,023 INFO  
>org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
>state to restore 2021-03-08 13:16:43,024 INFO  
>org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
>failover strategy 
>org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
> for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 2021-03-08 
>13:16:43,046 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl     
> [] - JobManager runner for job BrandCollectionTrackingJob 
>(9a534b2e309b24f78866b65d94082ead) was granted leadership with session id 
>c258d8ce-69d3-49df-8bee-1b748d5bbe74 at 
>akka.tcp://flink@10.2.179.12:6123/user/rpc/jobmanager_2. 2021-03-08 
>13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport                  
> [] - Remote connection to [null] failed with java.net.NoRouteToHostException: 
>No route to host 2021-03-08 13:16:43,060 WARN  
>akka.remote.ReliableDeliverySupervisor                       [] - Association 
>with remote system [akka.tcp://flink@10.2.174.188:6123] has failed, address is 
>now gated for [50] ms. Reason: [Association failed with 
>[akka.tcp://flink@10.2.174.188:6123]] Caused by: 
>[java.net.NoRouteToHostException: No route to host] }}
>
> 
>
>Attached is the log, and our configuration.
>
> 
>
>
>
>--
>This message was sent by Atlassian Jira
>(v8.3.4#803005)

Reply via email to