Hi Devs! We noticed a very strange failure scenario a few times recently with the Native Kubernetes integration.
The issue is triggered by a heartbeat timeout (a temporary network problem). We observe the following behaviour: =================================== 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration): 1. Temporary network problem - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection. - Both the JM and TM1 trigger the job failure on their sides and cancel the tasks - JM releases TM1 slots 2. While failing/cancelling the job, the network connection recovers and TM1 reconnects to JM: *TM1: Resolved JobManager address, beginning registration* 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps failing as it cannot seem to allocate all the resources: *NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout* On TM1 we see the following logs repeating (mutliple times every few seconds until the slot request times out after 5 minutes): *Receive slot request ... for job ... from resource manager with leader id ...* *Allocated slot for ...* *Receive slot request ... for job ... from resource manager with leader id ...* *Allocated slot for ....* *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: ResourceProfile{...}, allocationId: ..., jobId: ...).* While all these are happening on TM1 we don't see any allocation related INFO logs on TM2. =================================== Seems like something weird happens when TM1 reconnects after the heartbeat loss. I feel that the JM should probably shut down the TM and create a new one. But instead it gets stuck. Any ideas what could be happening here? Thanks Gyula