I have a Flink app on 1.5.2 which sources data from Kafka topic (400 
partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3 
with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min. 
Checkpoint size is a few mb. After execution for a few days, I see:

Org.apache.flink.runtime.executiongraph.ExecutionGraph - Error in failover 
strategy - falling back to global restart
Java.lang.ClassCastException: com.amazonaws.services.s3.model.AmazonS3Exception 
cannot be cast to com.amazonaws.AmazonClientException
At 
org.apache.hadoop.fs.s3a.AWSClientIOException.getCause(AWSClientIOException.java:42)
At org.apache.flink.util.SerializedThrowable
At org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatus()
At 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
At akka.dispatch.Mailbox.exec(Mailbox.scala:234)

What causes the exception  and  why is the Flink job unable to recover? It 
states failing back to globsl restart? How can this be configured to recover 
properly? Is the checkloche interval/timeout too low? The Flink job's 
configuration shows Restart with fixed delay (0ms) #2147483647 restart attempts.

Reply via email to