Hi John,
this is definitely not how Flink should behave in this situation and could
indicate a bug. From the logs I couldn't figure out the problem. Would it
be possible to obtain for the TMs and JM the full logs with DEBUG log
level? This would help me to further debug the problem.
Cheers,
Till
Is this a known issue? Should I create a Jira ticket? Does anyone have
anything they would like me to try? I’m very lost at this point.
I’ve now seen this issue happen without destroying pods, i.e. the job running
crashes after several hours and fails to recover once all task slots are
consu