Hi, Gen Thanks for replying! The reasoning overall makes sense. But in this case, when JM sends a cancel request to a killed TM, why the call timeout 30s instead of returning "connection refused" immediately?
Best Lu On Tue, Jun 29, 2021 at 7:52 PM Gen Luo <luogen...@gmail.com> wrote: > Hi Lu, > > We found almost the same thing when we were trying failover in a large > scale job. The akka.ask.timeout and heartbeat.timeout were set to 10min for > the test, and we found that the job would take 10min to recover from TM > lost. > > We reached the conclusion that the behavior is expected in the Flink > version we used(1.13), and it should also apply to previous versions. > Here's how we think things are happening. > > Flink JM uses heartbeat to check TM status. When a TM is lost, an upstream > or downstream TM will first sense it, while JM is not aware until the > heartbeat of the lost TM is timeout. The upstream/downstream TM then sends > an exception to JM, and JM will start canceling the job by sending cancel > requests to all TMs. However, the lost TM can't respond to the request, and > JM has to wait until the cancel request(via akka) is timeout before it can > mark the TM failed and continue the failover procedure. > > We suppose that the phase from CANCELING to CANCELED takes > min(akka.ask.timeout, heartbeat.timeout), though not confirmed yet. > > Hope it helps. Please let me know if there's anything wrong. > > Lu Niu <qqib...@gmail.com> 于2021年6月30日周三 上午8:45写道: > >> Hi, Flink Users >> >> We(Pinterest) are trying to speed up recovery speed when flink jobs hit >> one-time exceptions. To understand the baseline, the first test we do is to >> randomly kill one TM container and watch for how fast the flink job can >> recover. We did such test to multiple jobs and here are some findings: >> >> - The whole recovery stage can break down into two phases: >> - Phase 1: job state switched from RUNNING -> RESTARTING -> >> RUNNING. All tasks switch from RUNNING -> CANCELING -> CANCELED. >> - Phase 2: All tasks switch from CREATED -> SCHEDULED -> DEPLOYING >> -> RUNNING. >> - Phase 1 always takes around 30s and Phase 2 takes around 10 - 15s. >> >> Question: >> Why does Phase 1 always take about 30s? I shared related logs about 2 >> jobs showing that. Does it have sth to do with akka config? >> >> Our setup: >> flink version 1.11 >> running yarn per-job mode >> akka.ask.timeout: 30 s >> akka.lookup.timeout: 30 s >> akka.tcp.timeout: 30 s >> >> Best >> Lu >> >>