Re: Uncaught exception in FatalExitExceptionHandler causing JM crash while canceling job

2021-01-15 Thread Khachatryan Roman
I think you're right Till, this is the problem. In fact, I opened a duplicating jira ticket in parallel :) I hope we can fix it in the next version of 1.12. Regards, Roman On Fri, Jan 15, 2021 at 2:09 PM Till Rohrmann wrote: > Thanks for reporting and analyzing this issue Kelly. I think you ar

Re: Uncaught exception in FatalExitExceptionHandler causing JM crash while canceling job

2021-01-15 Thread Till Rohrmann
Thanks for reporting and analyzing this issue Kelly. I think you are indeed running into a Flink bug. I think the problem is the following: With Flink 1.12.0 [1] we introduced a throttling mechanism for discarding checkpoints. The way it is implemented is that once a checkpoint is discarded it can

Uncaught exception in FatalExitExceptionHandler causing JM crash while canceling job

2021-01-13 Thread Kelly Smith
Hi folks, I recently upgraded to Flink 1.12.0 and I’m hitting an issue where my JM is crashing while cancelling a job. This is causing Kubernetes readiness probes to fail, the JM to be restarted, and then get in a bad state while it tries to recover itself using ZK + a checkpoint which no longe