Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

刘建刚 Thu, 22 Jul 2021 04:23:15 -0700

Thanks, Till. There are many reasons to reduce the heartbeat interval and
timeout. But I am not sure what values are suitable. In our cases, the GC
time and big job can be related factors. Since most flink jobs are pipeline
and a total failover can cost some time, we should tolerate some stop-world
situations. Also, I think that the FLINK-23216 should be solved to detect
lost container fast and react to it. For my side, I suggest
reducing the values gradually.


Till Rohrmann <trohrm...@apache.org> 于2021年7月22日周四 下午5:33写道：

> Thanks for your inputs Gen and Arnaud.
>
> I do agree with you, Gen, that we need better guidance for our users on
> when to change the heartbeat configuration. I think this should happen in
> any case. I am, however, not so sure whether we can give hard threshold
> like 5000 tasks, for example, because as Arnaud said it strongly depends on
> the workload. Maybe we can explain it based on symptoms a user might
> experience and what to do then.
>
> Concerning your workloads, Arnaud, I'd be interested to learn a bit more.
> The user code runs in its own thread. This means that its operation won't
> block the main thread/heartbeat. The only thing that can happen is that the
> user code starves the heartbeat in terms of CPU cycles or causes a lot of
> GC pauses. If you are observing the former problem, then we might think
> about changing the priorities of the respective threads. This should then
> improve Flink's stability for these workloads and a shorter heartbeat
> timeout should be possible.
>
> Also for the RAM-cached repositories, what exactly is causing the heartbeat
> to time out? Is it because you have a lot of GC or that the heartbeat
> thread does not get enough CPU cycles?
>
> Cheers,
> Till
>
> On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud <al...@bouyguestelecom.fr>
> wrote:
>
> > Hello,
> >
> >
> >
> > From a user perspective: we have some (rare) use cases where we use
> > “coarse grain” datasets, with big beans and tasks that do lengthy
> operation
> > (such as ML training). In these cases we had to increase the time out to
> > huge values (heartbeat.timeout: 500000) so that our app is not killed.
> >
> > I’m aware this is not the way Flink was meant to be used, but it’s a
> > convenient way to distribute our workload on datanodes without having to
> > use another concurrency framework (such as M/R) that would require the
> > recoding of sources and sinks.
> >
> >
> >
> > In some other (most common) cases, our tasks do some R/W accesses to
> > RAM-cached repositories backed by a key-value storage such as Kudu (or
> > Hbase). If most of those calls are very fast, sometimes when the system
> is
> > under heavy load they may block more than a few seconds, and having our
> app
> > killed because of a short timeout is not an option.
> >
> >
> >
> > That’s why I’m not in favor of very short timeouts… Because in my
> > experience it really depends on what user code does in the tasks. (I
> > understand that normally, as user code is not a JVM-blocking activity
> such
> > as a GC, it should have no impact on heartbeats, but from experience, it
> > really does)
> >
> >
> >
> > Cheers,
> >
> > Arnaud
> >
> >
> >
> >
> >
> > *De :* Gen Luo <luogen...@gmail.com>
> > *Envoyé :* jeudi 22 juillet 2021 05:46
> > *À :* Till Rohrmann <trohrm...@apache.org>
> > *Cc :* Yang Wang <danrtsey...@gmail.com>; dev <dev@flink.apache.org>;
> > user <u...@flink.apache.org>
> > *Objet :* Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval
> > default values
> >
> >
> >
> > Hi,
> >
> > Thanks for driving this @Till Rohrmann <trohrm...@apache.org> . I would
> > give +1 on reducing the heartbeat timeout and interval, though I'm not
> sure
> > if 15s and 3s would be enough either.
> >
> >
> >
> > IMO, except for the standalone cluster, where the heartbeat mechanism in
> > Flink is totally relied, reducing the heartbeat can also help JM to find
> > out faster TaskExecutors in abnormal conditions that can not respond to
> the
> > heartbeat requests, e.g., continuously Full GC, though the process of
> > TaskExecutor is alive and may not be known by the deployment system.
> Since
> > there are cases that can benefit from this change, I think it could be
> done
> > if it won't break the experience in other scenarios.
> >
> >
> >
> > If we can address what will block the main threads from processing
> > heartbeats, or enlarge the GC costs, we can try to get rid of them to
> have
> > a more predictable response time of heartbeat, or give some advices to
> > users if their jobs may encounter these issues. For example, as far as I
> > know JM of a large scale job will be more busy and may not able to
> process
> > heartbeats in time, then we can give a advice that users working with job
> > large than 5000 tasks should enlarge there heartbeat interval to 10s and
> > timeout to 50s. The numbers are written casually.
> >
> >
> >
> > As for the issue in FLINK-23216, I think it should be fixed and may not
> be
> > a main concern for this case.
> >
> >
> >
> > On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann <trohrm...@apache.org>
> > wrote:
> >
> > Thanks for sharing these insights.
> >
> >
> >
> > I think it is no longer true that the ResourceManager notifies the
> > JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details.
> >
> >
> >
> > Given the GC pauses, would you then be ok with decreasing the heartbeat
> > timeout to 20 seconds? This should give enough time to do the GC and then
> > still send/receive a heartbeat request.
> >
> >
> >
> > I also wanted to add that we are about to get rid of one big cause of
> > blocking I/O operations from the main thread. With FLINK-22483 [2] we
> will
> > get rid of Filesystem accesses to retrieve completed checkpoints. This
> > leaves us with one additional file system access from the main thread
> which
> > is the one completing a pending checkpoint. I think it should be possible
> > to get rid of this access because as Stephan said it only writes
> > information to disk that is already written before. Maybe solving these
> two
> > issues could ease concerns about long pauses of unresponsiveness of
> Flink.
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-23216
> >
> > [2] https://issues.apache.org/jira/browse/FLINK-22483
> >
> >
> >
> > Cheers,
> >
> > Till
> >
> >
> >
> > On Wed, Jul 21, 2021 at 4:58 AM Yang Wang <danrtsey...@gmail.com> wrote:
> >
> > Thanks @Till Rohrmann <trohrm...@apache.org>  for starting this
> discussion
> >
> >
> >
> > Firstly, I try to understand the benefit of shorter heartbeat timeout.
> > IIUC, it will make the JobManager aware of
> >
> > TaskManager faster. However, it seems that only the standalone cluster
> > could benefit from this. For Yarn and
> >
> > native Kubernetes deployment, the Flink ResourceManager should get the
> > TaskManager lost event in a very short time.
> >
> >
> >
> > * About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM
> >
> > * Less than 1 second, Flink RM has a watch for all the TaskManager pods
> >
> >
> >
> > Secondly, I am not very confident to decrease the timeout to 15s. I have
> > quickly checked the TaskManager GC logs
> >
> > in the past week of our internal Flink workloads and find more than 100
> > 10-seconds Full GC logs, but no one is bigger than 15s.
> >
> > We are using CMS GC for old generation.
> >
> >
> >
> >
> >
> > Best,
> >
> > Yang
> >
> >
> >
> > Till Rohrmann <trohrm...@apache.org> 于2021年7月17日周六 上午1:05写道：
> >
> > Hi everyone,
> >
> > Since Flink 1.5 we have the same heartbeat timeout and interval default
> > values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
> > 10s. These values were mainly chosen to compensate for lengthy GC pauses
> > and blocking operations that were executed in the main threads of Flink's
> > components. Since then, there were quite some advancements wrt the JVM's
> > GCs and we also got rid of a lot of blocking calls that were executed in
> > the main thread. Moreover, a long heartbeat.timeout causes long recovery
> > times in case of a TaskManager loss because the system can only properly
> > recover after the dead TaskManager has been removed from the scheduler.
> > Hence, I wanted to propose to change the timeout and interval to:
> >
> > heartbeat.timeout: 15s
> > heartbeat.interval: 3s
> >
> > Since there is no perfect solution that fits all use cases, I would
> really
> > like to hear from you what you think about it and how you configure these
> > heartbeat options. Based on your experience we might actually come up
> with
> > better default values that allow us to be resilient but also to detect
> > failed components fast. FLIP-185 can be found here [1].
> >
> > [1] https://cwiki.apache.org/confluence/x/GAoBCw
> >
> > Cheers,
> > Till
> >
> >
> > ------------------------------
> >
> > L'intégrité de ce message n'étant pas assurée sur internet, la société
> > expéditrice ne peut être tenue responsable de son contenu ni de ses
> pièces
> > jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
> > vous n'êtes pas destinataire de ce message, merci de le détruire et
> > d'avertir l'expéditeur.
> >
> > The integrity of this message cannot be guaranteed on the Internet. The
> > company that sent this message cannot therefore be held liable for its
> > content nor attachments. Any unauthorized use or dissemination is
> > prohibited. If you are not the intended recipient of this message, then
> > please delete it and notify the sender.
> >
>

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

Reply via email to