I have a Flink app on 1.5.2 which sources data from Kafka topic (400 partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3 with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min. Checkpoint size is a few mb. After execution for a few days, I see:
Org.apache.flink.runtime.executiongraph.ExecutionGraph - Error in failover strategy - falling back to global restart Java.lang.ClassCastException: com.amazonaws.services.s3.model.AmazonS3Exception cannot be cast to com.amazonaws.AmazonClientException At org.apache.hadoop.fs.s3a.AWSClientIOException.getCause(AWSClientIOException.java:42) At org.apache.flink.util.SerializedThrowable At org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatus() At org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247) At akka.dispatch.Mailbox.exec(Mailbox.scala:234) What causes the exception and why is the Flink job unable to recover? It states failing back to globsl restart? How can this be configured to recover properly? Is the checkloche interval/timeout too low? The Flink job's configuration shows Restart with fixed delay (0ms) #2147483647 restart attempts.