Yes, time is main when detecting the TM's liveness. The count method will
check by certain intervals.
Gen Luo 于2021年7月9日周五 上午10:37写道:
> @刘建刚
> Welcome to join the discuss and thanks for sharing your experience.
>
> I have a minor question. In my experience, network failures in a certain
> cluste
Gen is right with his explanation why the dead TM discovery can be faster
with Flink < 1.12.
Concerning flaky TaskManager connections:
2.1 I think the problem is that the receiving TM does not know the
container ID of the sending TM. It only knows its address. But this is
something one could impr
@刘建刚
Welcome to join the discuss and thanks for sharing your experience.
I have a minor question. In my experience, network failures in a certain
cluster usually takes a time to recovery, which can be measured as p99 to
guide configuring. So I suppose it would be better to use time than attempt
co
Thanks everyone! This is a great discussion!
1. Restarting takes 30s when throwing exceptions from application code
because the restart delay is 30s in config. Before lots of related config
are 30s which lead to the confusion. I redo the test with config:
FixedDelayRestartBackoffTimeStrategy(maxN
It is really helpful to find the lost container quickly. In our inner flink
version, we optimize it by task's report and jobmaster's probe. When a task
fails because of the connection, it reports to the jobmaster. The jobmaster
will try to confirm the liveness of the unconnected taskmanager for cer
Yes, I have noticed the PR and commented there with some consideration
about the new option. We can discuss further there.
On Tue, Jul 6, 2021 at 6:04 PM Till Rohrmann wrote:
> This is actually a very good point Gen. There might not be a lot to gain
> for us by implementing a fancy algorithm for
This is actually a very good point Gen. There might not be a lot to gain
for us by implementing a fancy algorithm for figuring out whether a TM is
dead or not based on failed heartbeat RPCs from the JM if the TM <> TM
communication does not tolerate failures and directly fails the affected
tasks. T
I know that there are retry strategies for akka rpc frameworks. I was just
considering that, since the environment is shared by JM and TMs, and the
connections among TMs (using netty) are flaky in unstable environments,
which will also cause the job failure, is it necessary to build a
strongly guar
I think for RPC communication there are retry strategies used by the
underlying Akka ActorSystem. So a RpcEndpoint can reconnect to a remote
ActorSystem and resume communication. Moreover, there are also
reconciliation protocols in place which reconcile the states between the
components because of
As far as I know, a TM will report connection failure once its connected TM
is lost. I suppose JM can believe the report and fail the tasks in the lost
TM if it also encounters a connection failure.
Of course, it won't work if the lost TM is standalone. But I suppose we can
use the same strategy a
Could you share the full logs with us for the second experiment, Lu? I
cannot tell from the top of my head why it should take 30s unless you have
configured a restart delay of 30s.
Let's discuss FLINK-23216 on the JIRA ticket, Gen.
I've now implemented FLINK-23209 [1] but it somehow has the probl
Thanks for sharing, Till and Yang.
@Lu
Sorry but I don't know how to explain the new test with the log. Let's wait
for others' reply.
@Till
It would be nice if JIRAs could be fixed. Thanks again for proposing them.
In addition, I was tracking an issue that RM keeps allocating and freeing
slots a
Another side question, Shall we add metric to cover the complete restarting
time (phase 1 + phase 2)? Current metric jm.restartingTime only covers
phase 1. Thanks!
Best
Lu
On Thu, Jul 1, 2021 at 12:09 PM Lu Niu wrote:
> Thanks TIll and Yang for help! Also Thanks Till for a quick fix!
>
> I did
Thanks TIll and Yang for help! Also Thanks Till for a quick fix!
I did another test yesterday. In this test, I intentionally throw exception
from the source operator:
```
if (runtimeContext.getIndexOfThisSubtask() == 1
&& errorFrenquecyInMin > 0
&& System.currentTimeMillis() - last
Thanks TIll and Yang for help! Also Thanks Till for a quick fix!
I did another test yesterday. In this test, I intentionally throw exception
from the source operator:
```
if (runtimeContext.getIndexOfThisSubtask() == 1
&& errorFrenquecyInMin > 0
&& System.currentTimeMillis() - last
A quick addition, I think with FLINK-23202 it should now also be possible
to improve the heartbeat mechanism in the general case. We can leverage the
unreachability exception thrown if a remote target is no longer reachable
to mark an heartbeat target as no longer reachable [1]. This can then be
co
Since you are deploying Flink workloads on Yarn, the Flink ResourceManager
should get the container
completion event after the heartbeat of Yarn NM->Yarn RM->Flink RM, which
is 8 seconds by default.
And Flink ResourceManager will release the dead TaskManager container once
received the completion e
The analysis of Gen is correct. Flink currently uses its heartbeat as the
primary means to detect dead TaskManagers. This means that Flink will take
at least `heartbeat.timeout` time before the system recovers. Even if the
cancellation happens fast (e.g. by having configured a low
akka.ask.timeout)
Thanks Gen! cc flink-dev to collect more inputs.
Best
Lu
On Wed, Jun 30, 2021 at 12:55 AM Gen Luo wrote:
> I'm also wondering here.
>
> In my opinion, it's because the JM can not confirm whether the TM is lost
> or it's a temporary network trouble and will recover soon, since I can see
> in the
I'm also wondering here.
In my opinion, it's because the JM can not confirm whether the TM is lost
or it's a temporary network trouble and will recover soon, since I can see
in the log that akka has got a Connection refused but JM still sends a
heartbeat request to the lost TM until it reaches hea
Hi, Gen
Thanks for replying! The reasoning overall makes sense. But in this case,
when JM sends a cancel request to a killed TM, why the call timeout 30s
instead of returning "connection refused" immediately?
Best
Lu
On Tue, Jun 29, 2021 at 7:52 PM Gen Luo wrote:
> Hi Lu,
>
> We found almost
Hi Lu,
We found almost the same thing when we were trying failover in a large
scale job. The akka.ask.timeout and heartbeat.timeout were set to 10min for
the test, and we found that the job would take 10min to recover from TM
lost.
We reached the conclusion that the behavior is expected in the Fl
22 matches
Mail list logo