When this happens, it appears that one of the workers fails but the rest of the
workers continue to run. How would I be able to configure the app to be able to
recover itself completely from the last successful checkpoint when this happens?
‐‐‐ Original Message ‐‐‐
On Monday, December 3,
I have a Flink app on 1.5.2 which sources data from Kafka topic (400
partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3
with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min.
Checkpoint size is a few mb. After execution for a few days, I see:
Org