When this happens, it appears that one of the workers fails but the rest of the workers continue to run. How would I be able to configure the app to be able to recover itself completely from the last successful checkpoint when this happens?
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, December 3, 2018 11:02 AM, Flink Developer <developer...@protonmail.com> wrote: > I have a Flink app on 1.5.2 which sources data from Kafka topic (400 > partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3 > with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min. > Checkpoint size is a few mb. After execution for a few days, I see: > > Org.apache.flink.runtime.executiongraph.ExecutionGraph - Error in failover > strategy - falling back to global restart > Java.lang.ClassCastException: > com.amazonaws.services.s3.model.AmazonS3Exception cannot be cast to > com.amazonaws.AmazonClientException > At > org.apache.hadoop.fs.s3a.AWSClientIOException.getCause(AWSClientIOException.java:42) > At org.apache.flink.util.SerializedThrowable > At org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatus() > At > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247) > At akka.dispatch.Mailbox.exec(Mailbox.scala:234) > > What causes the exception and why is the Flink job unable to recover? It > states failing back to globsl restart? How can this be configured to recover > properly? Is the checkloche interval/timeout too low? The Flink job's > configuration shows Restart with fixed delay (0ms) #2147483647 restart > attempts.