Re: Job Recovery Time on TM Lost

刘建刚 Mon, 12 Jul 2021 03:13:19 -0700

Yes, time is main when detecting the TM's liveness. The count method will
check by certain intervals.


Gen Luo <luogen...@gmail.com> 于2021年7月9日周五 上午10:37写道：

> @刘建刚
> Welcome to join the discuss and thanks for sharing your experience.
>
> I have a minor question. In my experience, network failures in a certain
> cluster usually takes a time to recovery, which can be measured as p99 to
> guide configuring. So I suppose it would be better to use time than attempt
> count as the configuration for confirming TM liveness. How do you think
> about this? Or is the premise right according to your experience?
>
> @Lu Niu <qqib...@gmail.com>
> > Does that mean the akka timeout situation we talked above doesn't apply
> to flink 1.11?
>
> I suppose it's true. According to the reply from Till in FLINK-23216
> <https://issues.apache.org/jira/browse/FLINK-23216>, it should be
> confirmed that the problem is introduced by declarative resource
> management, which is introduced to Flink in 1.12.
>
> In previous versions, although JM still uses heartbeat to check TMs
> status, RM will tell JM about TM lost once it is noticed by Yarn. This is
> much faster than JM's heartbeat mechanism, if one uses default heartbeat
> configurations. However, after 1.12 with declarative resource management,
> RM will no longer tell this to JM, since it doesn't have a related
> AllocationID.  So the heartbeat mechanism becomes the only way JM can know
> about TM lost.
>
> On Fri, Jul 9, 2021 at 6:34 AM Lu Niu <qqib...@gmail.com> wrote:
>
>> Thanks everyone! This is a great discussion!
>>
>> 1. Restarting takes 30s when throwing exceptions from application code
>> because the restart delay is 30s in config. Before lots of related config
>> are 30s which lead to the confusion. I redo the test with config:
>>
>> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647,
>> backoffTimeMS=1000)
>> heartbeat.timeout: 500000
>> akka.ask.timeout 30 s
>> akka.lookup.timeout 30 s
>> akka.tcp.timeout 30 s
>> akka.watch.heartbeat.interval 30 s
>> akka.watch.heartbeat.pause 120 s
>>
>>    Now Phase 1 drops down to 2s now and phase 2 takes 13s. The whole
>> restart takes 14s. Does that mean the akka timeout situation we talked
>> above doesn't apply to flink 1.11?
>>
>> 2. About flaky connection between TMs, we did notice sometimes exception
>> as follows:
>> ```
>> TaskFoo switched from RUNNING to FAILED on
>> container_e02_1599158147594_156068_01_000038 @
>> xenon-prod-001-20200316-data-slave-prod-0a0201a4.ec2.pin220.com
>> (dataPort=40957).
>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>> Connection unexpectedly closed by remote task manager '
>> xenon-prod-001-20200316-data-slave-prod-0a020f7a.ec2.pin220.com/10.2.15.122:33539'.
>> This might indicate that the remote task manager was lost.
>> at
>> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>> at
>> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:331)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> at java.lang.Thread.run(Thread.java:748)
>> ```
>> 1. It's a bit inconvenient to debug such an exception because it doesn't
>> report the exact container id. Right now we have to look for `
>> xenon-prod-001-20200316-data-slave-prod-0a020f7a.ec2.pin220.com/10.2.15.122:33539`
>> <http://xenon-prod-001-20200316-data-slave-prod-0a020f7a.ec2.pin220.com/10.2.15.122:33539>
>> in JobMananger log to find that.
>> 2. The task manager log doesn't show anything suspicious. Also, no major
>> GC. So it might imply a flack connection in this case.
>> 3. Is there any short term workaround we can try? any config tuning?
>> Also, what's the long term solution?
>>
>> Best
>> Lu
>>
>>
>>
>>
>> On Tue, Jul 6, 2021 at 11:45 PM 刘建刚 <liujiangangp...@gmail.com> wrote:
>>
>>> It is really helpful to find the lost container quickly. In our inner
>>> flink version, we optimize it by task's report and jobmaster's probe. When
>>> a task fails because of the connection, it reports to the jobmaster. The
>>> jobmaster will try to confirm the liveness of the unconnected
>>> taskmanager for certain times by config. If the jobmaster find the
>>> taskmanager unconnected or dead, it releases the taskmanger. This will work
>>> for most cases. For an unstable environment, config needs adjustment.
>>>
>>> Gen Luo <luogen...@gmail.com> 于2021年7月6日周二 下午8:41写道：
>>>
>>>> Yes, I have noticed the PR and commented there with some consideration
>>>> about the new option. We can discuss further there.
>>>>
>>>> On Tue, Jul 6, 2021 at 6:04 PM Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>>
>>>> > This is actually a very good point Gen. There might not be a lot to
>>>> gain
>>>> > for us by implementing a fancy algorithm for figuring out whether a
>>>> TM is
>>>> > dead or not based on failed heartbeat RPCs from the JM if the TM <> TM
>>>> > communication does not tolerate failures and directly fails the
>>>> affected
>>>> > tasks. This assumes that the JM and TM run in the same environment.
>>>> >
>>>> > One simple approach could be to make the number of failed heartbeat
>>>> RPCs
>>>> > until a target is marked as unreachable configurable because what
>>>> > represents a good enough criterion in one user's environment might
>>>> produce
>>>> > too many false-positives in somebody else's environment. Or even
>>>> simpler,
>>>> > one could say that one can disable reacting to a failed heartbeat RPC
>>>> as it
>>>> > is currently the case.
>>>> >
>>>> > We currently have a discussion about this on this PR [1]. Maybe you
>>>> wanna
>>>> > join the discussion there and share your insights.
>>>> >
>>>> > [1] https://github.com/apache/flink/pull/16357
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Tue, Jul 6, 2021 at 4:37 AM Gen Luo <luogen...@gmail.com> wrote:
>>>> >
>>>> >> I know that there are retry strategies for akka rpc frameworks. I was
>>>> >> just considering that, since the environment is shared by JM and
>>>> TMs, and
>>>> >> the connections among TMs (using netty) are flaky in unstable
>>>> environments,
>>>> >> which will also cause the job failure, is it necessary to build a
>>>> >> strongly guaranteed connection between JM and TMs, or it could be as
>>>> flaky
>>>> >> as the connections among TMs?
>>>> >>
>>>> >> As far as I know, connections among TMs will just fail on their first
>>>> >> connection loss, so behaving like this in JM just means "as flaky as
>>>> >> connections among TMs". In a stable environment it's good enough,
>>>> but in an
>>>> >> unstable environment, it indeed increases the instability. IMO,
>>>> though a
>>>> >> single connection loss is not reliable, a double check should be good
>>>> >> enough. But since I'm not experienced with an unstable environment,
>>>> I can't
>>>> >> tell whether that's also enough for it.
>>>> >>
>>>> >> On Mon, Jul 5, 2021 at 5:59 PM Till Rohrmann <trohrm...@apache.org>
>>>> >> wrote:
>>>> >>
>>>> >>> I think for RPC communication there are retry strategies used by the
>>>> >>> underlying Akka ActorSystem. So a RpcEndpoint can reconnect to a
>>>> remote
>>>> >>> ActorSystem and resume communication. Moreover, there are also
>>>> >>> reconciliation protocols in place which reconcile the states
>>>> between the
>>>> >>> components because of potentially lost RPC messages. So the main
>>>> question
>>>> >>> would be whether a single connection loss is good enough for
>>>> triggering the
>>>> >>> timeout or whether we want a more elaborate mechanism to reason
>>>> about the
>>>> >>> availability of the remote system (e.g. a couple of lost heartbeat
>>>> >>> messages).
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Till
>>>> >>>
>>>> >>> On Mon, Jul 5, 2021 at 10:00 AM Gen Luo <luogen...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>>> As far as I know, a TM will report connection failure once its
>>>> >>>> connected TM is lost. I suppose JM can believe the report and fail
>>>> the
>>>> >>>> tasks in the lost TM if it also encounters a connection failure.
>>>> >>>>
>>>> >>>> Of course, it won't work if the lost TM is standalone. But I
>>>> suppose we
>>>> >>>> can use the same strategy as the connected scenario. That is,
>>>> consider it
>>>> >>>> possibly lost on the first connection loss, and fail it if double
>>>> check
>>>> >>>> also fails. The major difference is the senders of the probes are
>>>> the same
>>>> >>>> one rather than two different roles, so the results may tend to be
>>>> the same.
>>>> >>>>
>>>> >>>> On the other hand, the fact also means that the jobs can be
>>>> fragile in
>>>> >>>> an unstable environment, no matter whether the failover is
>>>> triggered by TM
>>>> >>>> or JM. So maybe it's not that worthy to introduce extra
>>>> configurations for
>>>> >>>> fault tolerance of heartbeat, unless we also introduce some retry
>>>> >>>> strategies for netty connections.
>>>> >>>>
>>>> >>>>
>>>> >>>> On Fri, Jul 2, 2021 at 9:34 PM Till Rohrmann <trohrm...@apache.org
>>>> >
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>>> Could you share the full logs with us for the second experiment,
>>>> Lu? I
>>>> >>>>> cannot tell from the top of my head why it should take 30s unless
>>>> you have
>>>> >>>>> configured a restart delay of 30s.
>>>> >>>>>
>>>> >>>>> Let's discuss FLINK-23216 on the JIRA ticket, Gen.
>>>> >>>>>
>>>> >>>>> I've now implemented FLINK-23209 [1] but it somehow has the
>>>> problem
>>>> >>>>> that in a flakey environment you might not want to mark a
>>>> TaskExecutor dead
>>>> >>>>> on the first connection loss. Maybe this is something we need to
>>>> make
>>>> >>>>> configurable (e.g. introducing a threshold which admittedly is
>>>> similar to
>>>> >>>>> the heartbeat timeout) so that the user can configure it for her
>>>> >>>>> environment. On the upside, if you mark the TaskExecutor dead on
>>>> the first
>>>> >>>>> connection loss (assuming you have a stable network environment),
>>>> then it
>>>> >>>>> can now detect lost TaskExecutors as fast as the heartbeat
>>>> interval.
>>>> >>>>>
>>>> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-23209
>>>> >>>>>
>>>> >>>>> Cheers,
>>>> >>>>> Till
>>>> >>>>>
>>>> >>>>> On Fri, Jul 2, 2021 at 9:33 AM Gen Luo <luogen...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Thanks for sharing, Till and Yang.
>>>> >>>>>>
>>>> >>>>>> @Lu
>>>> >>>>>> Sorry but I don't know how to explain the new test with the log.
>>>> >>>>>> Let's wait for others' reply.
>>>> >>>>>>
>>>> >>>>>> @Till
>>>> >>>>>> It would be nice if JIRAs could be fixed. Thanks again for
>>>> proposing
>>>> >>>>>> them.
>>>> >>>>>>
>>>> >>>>>> In addition, I was tracking an issue that RM keeps allocating and
>>>> >>>>>> freeing slots after a TM lost until its heartbeat timeout, when
>>>> I found the
>>>> >>>>>> recovery costing as long as heartbeat timeout. That should be a
>>>> minor bug
>>>> >>>>>> introduced by declarative resource management. I have created a
>>>> JIRA about
>>>> >>>>>> the problem [1] and  we can discuss it there if necessary.
>>>> >>>>>>
>>>> >>>>>> [1] https://issues.apache.org/jira/browse/FLINK-23216
>>>> >>>>>>
>>>> >>>>>> Lu Niu <qqib...@gmail.com> 于2021年7月2日周五 上午3:13写道：
>>>> >>>>>>
>>>> >>>>>>> Another side question, Shall we add metric to cover the complete
>>>> >>>>>>> restarting time (phase 1 + phase 2)? Current metric
>>>> jm.restartingTime only
>>>> >>>>>>> covers phase 1. Thanks!
>>>> >>>>>>>
>>>> >>>>>>> Best
>>>> >>>>>>> Lu
>>>> >>>>>>>
>>>> >>>>>>> On Thu, Jul 1, 2021 at 12:09 PM Lu Niu <qqib...@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> Thanks TIll and Yang for help! Also Thanks Till for a quick
>>>> fix!
>>>> >>>>>>>>
>>>> >>>>>>>> I did another test yesterday. In this test, I intentionally
>>>> throw
>>>> >>>>>>>> exception from the source operator:
>>>> >>>>>>>> ```
>>>> >>>>>>>> if (runtimeContext.getIndexOfThisSubtask() == 1
>>>> >>>>>>>>         && errorFrenquecyInMin > 0
>>>> >>>>>>>>         && System.currentTimeMillis() - lastStartTime >=
>>>> >>>>>>>> errorFrenquecyInMin * 60 * 1000) {
>>>> >>>>>>>>       lastStartTime = System.currentTimeMillis();
>>>> >>>>>>>>       throw new RuntimeException(
>>>> >>>>>>>>           "Trigger expected exception at: " + lastStartTime);
>>>> >>>>>>>>     }
>>>> >>>>>>>> ```
>>>> >>>>>>>> In this case, I found phase 1 still takes about 30s and Phase 2
>>>> >>>>>>>> dropped to 1s (because no need for container allocation). Why
>>>> phase 1 still
>>>> >>>>>>>> takes 30s even though no TM is lost?
>>>> >>>>>>>>
>>>> >>>>>>>> Related logs:
>>>> >>>>>>>> ```
>>>> >>>>>>>> 2021-06-30 00:55:07,463 INFO
>>>> >>>>>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> - Source:
>>>> >>>>>>>>
>>>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-frontend_event_source -> ...
>>>> >>>>>>>> java.lang.RuntimeException: Trigger expected exception at:
>>>> 1625014507446
>>>> >>>>>>>> 2021-06-30 00:55:07,509 INFO
>>>> >>>>>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> - Job
>>>> >>>>>>>> NRTG_USER_MATERIALIZED_EVENT_SIGNAL_user_context_staging
>>>> >>>>>>>> (35c95ee7141334845cfd49b642fa9f98) switched from state RUNNING
>>>> to
>>>> >>>>>>>> RESTARTING.
>>>> >>>>>>>> 2021-06-30 00:55:37,596 INFO
>>>> >>>>>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> - Job
>>>> >>>>>>>> NRTG_USER_MATERIALIZED_EVENT_SIGNAL_user_context_staging
>>>> >>>>>>>> (35c95ee7141334845cfd49b642fa9f98) switched from state
>>>> RESTARTING to
>>>> >>>>>>>> RUNNING.
>>>> >>>>>>>> 2021-06-30 00:55:38,678 INFO
>>>> >>>>>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> (time when
>>>> >>>>>>>> all tasks switch from CREATED to RUNNING)
>>>> >>>>>>>> ```
>>>> >>>>>>>> Best
>>>> >>>>>>>> Lu
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Thu, Jul 1, 2021 at 12:06 PM Lu Niu <qqib...@gmail.com>
>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> Thanks TIll and Yang for help! Also Thanks Till for a quick
>>>> fix!
>>>> >>>>>>>>>
>>>> >>>>>>>>> I did another test yesterday. In this test, I intentionally
>>>> throw
>>>> >>>>>>>>> exception from the source operator:
>>>> >>>>>>>>> ```
>>>> >>>>>>>>> if (runtimeContext.getIndexOfThisSubtask() == 1
>>>> >>>>>>>>>         && errorFrenquecyInMin > 0
>>>> >>>>>>>>>         && System.currentTimeMillis() - lastStartTime >=
>>>> >>>>>>>>> errorFrenquecyInMin * 60 * 1000) {
>>>> >>>>>>>>>       lastStartTime = System.currentTimeMillis();
>>>> >>>>>>>>>       throw new RuntimeException(
>>>> >>>>>>>>>           "Trigger expected exception at: " + lastStartTime);
>>>> >>>>>>>>>     }
>>>> >>>>>>>>> ```
>>>> >>>>>>>>> In this case, I found phase 1 still takes about 30s and Phase
>>>> 2
>>>> >>>>>>>>> dropped to 1s (because no need for container allocation).
>>>> >>>>>>>>>
>>>> >>>>>>>>> Some logs:
>>>> >>>>>>>>> ```
>>>> >>>>>>>>> ```
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Thu, Jul 1, 2021 at 6:28 AM Till Rohrmann <
>>>> trohrm...@apache.org>
>>>> >>>>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> A quick addition, I think with FLINK-23202 it should now
>>>> also be
>>>> >>>>>>>>>> possible to improve the heartbeat mechanism in the general
>>>> case. We can
>>>> >>>>>>>>>> leverage the unreachability exception thrown if a remote
>>>> target is no
>>>> >>>>>>>>>> longer reachable to mark an heartbeat target as no longer
>>>> reachable [1].
>>>> >>>>>>>>>> This can then be considered as if the heartbeat timeout has
>>>> been triggered.
>>>> >>>>>>>>>> That way we should detect lost TaskExecutors as fast as our
>>>> heartbeat
>>>> >>>>>>>>>> interval is.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-23209
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Cheers,
>>>> >>>>>>>>>> Till
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Thu, Jul 1, 2021 at 1:46 PM Yang Wang <
>>>> danrtsey...@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>> Since you are deploying Flink workloads on Yarn, the Flink
>>>> >>>>>>>>>>> ResourceManager should get the container
>>>> >>>>>>>>>>> completion event after the heartbeat of Yarn NM->Yarn
>>>> RM->Flink
>>>> >>>>>>>>>>> RM, which is 8 seconds by default.
>>>> >>>>>>>>>>> And Flink ResourceManager will release the dead TaskManager
>>>> >>>>>>>>>>> container once received the completion event.
>>>> >>>>>>>>>>> As a result, Flink will not deploy tasks onto the dead
>>>> >>>>>>>>>>> TaskManagers.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> I think most of the time cost in Phase 1 might be
>>>> cancelling the
>>>> >>>>>>>>>>> tasks on the dead TaskManagers.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Best,
>>>> >>>>>>>>>>> Yang
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Till Rohrmann <trohrm...@apache.org> 于2021年7月1日周四 下午4:49写道：
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> The analysis of Gen is correct. Flink currently uses its
>>>> >>>>>>>>>>>> heartbeat as the primary means to detect dead
>>>> TaskManagers. This means that
>>>> >>>>>>>>>>>> Flink will take at least `heartbeat.timeout` time before
>>>> the system
>>>> >>>>>>>>>>>> recovers. Even if the cancellation happens fast (e.g. by
>>>> having configured
>>>> >>>>>>>>>>>> a low akka.ask.timeout), then Flink will still try to
>>>> deploy tasks onto the
>>>> >>>>>>>>>>>> dead TaskManager until it is marked as dead and its slots
>>>> are released
>>>> >>>>>>>>>>>> (unless the ResourceManager does not get a signal from the
>>>> underlying
>>>> >>>>>>>>>>>> resource management system that a container/pod has died).
>>>> One way to
>>>> >>>>>>>>>>>> improve the situation is to introduce logic which can
>>>> react to a
>>>> >>>>>>>>>>>> ConnectionException and then black lists or releases a
>>>> TaskManager, for
>>>> >>>>>>>>>>>> example. This is currently not implemented in Flink,
>>>> though.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Concerning the cancellation operation: Flink currently
>>>> does not
>>>> >>>>>>>>>>>> listen to the dead letters of Akka. This means that the
>>>> `akka.ask.timeout`
>>>> >>>>>>>>>>>> is the primary means to fail the future result of a rpc
>>>> which could not be
>>>> >>>>>>>>>>>> sent. This is also an improvement we should add to Flink's
>>>> RpcService. I've
>>>> >>>>>>>>>>>> created a JIRA issue for this problem [1].
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-23202
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> Cheers,
>>>> >>>>>>>>>>>> Till
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> On Wed, Jun 30, 2021 at 6:33 PM Lu Niu <qqib...@gmail.com>
>>>> >>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Thanks Gen! cc flink-dev to collect more inputs.
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Best
>>>> >>>>>>>>>>>>> Lu
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>> On Wed, Jun 30, 2021 at 12:55 AM Gen Luo <
>>>> luogen...@gmail.com>
>>>> >>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> I'm also wondering here.
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> In my opinion, it's because the JM can not confirm
>>>> whether
>>>> >>>>>>>>>>>>>> the TM is lost or it's a temporary network trouble and
>>>> will recover soon,
>>>> >>>>>>>>>>>>>> since I can see in the log that akka has got a
>>>> Connection refused but JM
>>>> >>>>>>>>>>>>>> still sends a heartbeat request to the lost TM until it
>>>> reaches heartbeat
>>>> >>>>>>>>>>>>>> timeout. But I'm not sure if it's indeed designed like
>>>> this.
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> I would really appreciate it if anyone who knows more
>>>> details
>>>> >>>>>>>>>>>>>> could answer. Thanks.
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>
>>>>
>>>

Re: Job Recovery Time on TM Lost

Reply via email to