Hi Dan, You could refer to the "Fix Versions" in FLINK-16753 [1] and know that this bug is resolved after 1.11.3 not 1.11.1.
[1] https://issues.apache.org/jira/browse/FLINK-16753 Best Yun Tang ________________________________ From: Dan Hill <quietgol...@gmail.com> Sent: Tuesday, April 27, 2021 7:50 To: Yun Tang <myas...@live.com> Cc: Robert Metzger <rmetz...@apache.org>; user <user@flink.apache.org> Subject: Re: Checkpoint error - "The job has failed" Hey Yun and Robert, I'm using Flink v1.11.1. Robert, I'll send you a separate email with the logs. On Mon, Apr 26, 2021 at 12:46 AM Yun Tang <myas...@live.com<mailto:myas...@live.com>> wrote: Hi Dan, I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3. [1] https://issues.apache.org/jira/browse/FLINK-16753 Best Yun Tang ________________________________ From: Robert Metzger <rmetz...@apache.org<mailto:rmetz...@apache.org>> Sent: Monday, April 26, 2021 14:46 To: Dan Hill <quietgol...@gmail.com<mailto:quietgol...@gmail.com>> Cc: user <user@flink.apache.org<mailto:user@flink.apache.org>> Subject: Re: Checkpoint error - "The job has failed" Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <quietgol...@gmail.com<mailto:quietgol...@gmail.com>> wrote: My Flink job failed to checkpoint with a "The job has failed" error. The logs contained no other recent errors. I keep hitting the error even if I cancel the jobs and restart them. When I restarted my jobmanager and taskmanager, the error went away. What error am I hitting? It looks like there is bad state that lives outside the scope of a job. How often do people restart their jobmanagers and taskmanager to deal with errors like this?