Re: Job Recovery Time on TM Lost

Lu Niu Tue, 29 Jun 2021 21:50:08 -0700

Hi, Gen

Thanks for replying! The reasoning overall makes sense. But in this case,
when JM sends a cancel request to a killed TM,  why the call timeout 30s
instead of returning "connection refused" immediately?


Best
Lu

On Tue, Jun 29, 2021 at 7:52 PM Gen Luo <luogen...@gmail.com> wrote:

> Hi Lu,
>
> We found almost the same thing when we were trying failover in a large
> scale job. The akka.ask.timeout and heartbeat.timeout were set to 10min for
> the test, and we found that the job would take 10min to recover from TM
> lost.
>
> We reached the conclusion that the behavior is expected in the Flink
> version we used(1.13), and it should also apply to previous versions.
> Here's how we think things are happening.
>
> Flink JM uses heartbeat to check TM status. When a TM is lost, an upstream
> or downstream TM will first sense it, while JM is not aware until the
> heartbeat of the lost TM is timeout. The upstream/downstream TM then sends
> an exception to JM, and JM will start canceling the job by sending cancel
> requests to all TMs. However, the lost TM can't respond to the request, and
> JM has to wait until the cancel request(via akka) is timeout before it can
> mark the TM failed and continue the failover procedure.
>
> We suppose that the phase from CANCELING to CANCELED takes
> min(akka.ask.timeout, heartbeat.timeout), though not confirmed yet.
>
> Hope it helps. Please let me know if there's anything wrong.
>
> Lu Niu <qqib...@gmail.com> 于2021年6月30日周三 上午8:45写道：
>
>> Hi, Flink Users
>>
>> We(Pinterest) are trying to speed up recovery speed when flink jobs hit
>> one-time exceptions. To understand the baseline, the first test we do is to
>> randomly kill one TM container and watch for how fast the flink job can
>> recover. We did such test to multiple jobs and here are some findings:
>>
>>    - The whole recovery stage can break down into two phases:
>>       - Phase 1: job state switched from RUNNING -> RESTARTING ->
>>       RUNNING. All tasks switch from RUNNING -> CANCELING -> CANCELED.
>>       - Phase 2: All tasks switch from CREATED -> SCHEDULED -> DEPLOYING
>>       -> RUNNING.
>>    - Phase 1 always takes around 30s and Phase 2 takes around 10 - 15s.
>>
>> Question:
>> Why does Phase 1 always take about 30s? I shared related logs about 2
>> jobs showing that. Does it have sth to do with akka config?
>>
>> Our setup:
>> flink version 1.11
>> running yarn per-job mode
>> akka.ask.timeout: 30 s
>> akka.lookup.timeout: 30 s
>> akka.tcp.timeout: 30 s
>>
>> Best
>> Lu
>>
>>

Re: Job Recovery Time on TM Lost

Reply via email to