Hello Kostas, Thanks for your time.
I started that job from fresh, set checkpoint interval to 15 minutes. It completed the first 13 checkpoints successfully, only started failing from the 14th. I waited for about 20 more checkpoints, but all failed. Then I cancelled the job, restored from the last successful checkpoint, and there were no more issues. Today, I had another try - restoring from the last successful checkpoint from yesterday. Result: started getting the same error from the first checkpoint after restore. Tried to cancel and restore again, then no more issue until now (35 more checkpoints already). Regarding my job: I have 6 different S3-file-source streams connected/unioned together, and then connected to a 7th S3-file-source broadcast stream. Sinks are S3 parquet files and Elasticsearch. Checkpointing is incremental and uses RocksDB. This broadcast stream is one of the new changes to my job. The previous version with 4 out of those 6 sources has been running well for more than a month without any issue. TM/JM logs for the first run today (the failure one) are attached. The Yarn/EMR cluster is dedicated to the job. I have a feeling that the issue comes from that broadcast stream (as mentioned in the document, it doesn't use RocksDB). But not quite sure. Thanks and regards, Averell logs.gz <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/logs.gz> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/