Hi Devs!

We noticed a very strange failure scenario a few times recently with the
Native Kubernetes integration.

The issue is triggered by a heartbeat timeout (a temporary network
problem). We observe the following behaviour:

===================================
3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):

1. Temporary network problem
 - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
 - Both the JM and TM1 trigger the job failure on their sides and cancel
the tasks
 - JM releases TM1 slots

2. While failing/cancelling the job, the network connection recovers and
TM1 reconnects to JM:
*TM1: Resolved JobManager address, beginning registration*

3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
failing as it cannot seem to allocate all the resources:

*NoResourceAvailableException: Slot request bulk is not fulfillable! Could
not allocate the required slot within slot request timeout*
On TM1 we see the following logs repeating (mutliple times every few
seconds until the slot request times out after 5 minutes):
*Receive slot request ... for job ... from resource manager with leader id
...*
*Allocated slot for ...*
*Receive slot request ... for job ... from resource manager with leader id
...*
*Allocated slot for ....*
*Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
ResourceProfile{...}, allocationId: ..., jobId: ...).*

While all these are happening on TM1 we don't see any allocation related
INFO logs on TM2.
===================================

Seems like something weird happens when TM1 reconnects after the heartbeat
loss. I feel that the JM should probably shut down the TM and create a new
one. But instead it gets stuck.

Any ideas what could be happening here?

Thanks
Gyula

Reply via email to