RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-23 Thread LINZ, Arnaud
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values Thanks for your inputs Gen and Arnaud. I do agree with you, Gen, that we need better guidance for our users on when to change the heartbeat configuration. I think this should happen in any case. I am, however

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Gen Luo
> Hbase). If most of those calls are very fast, sometimes when the system >> is >> > under heavy load they may block more than a few seconds, and having our >> app >> > killed because of a short timeout is not an option. >> > >> > >> > >

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Chesnay Schepler
> Arnaud > > > > > > *De :* Gen Luo mailto:luogen...@gmail.com>> > *Envoyé :* jeudi 22 juillet 2021 05:46 > *À :* Till Rohrmann mailto:trohrm...@apache.org>> > *Cc :* Yang Wang mailto:danrtsey...@gmail.com>>;

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread 刘建刚
hould have no impact on heartbeats, but from experience, it > > really does) > > > > > > > > Cheers, > > > > Arnaud > > > > > > > > > > > > *De :* Gen Luo > > *Envoyé :* jeudi 22 juillet 2021 05:46 > > *À :* Ti

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Till Rohrmann
(I > understand that normally, as user code is not a JVM-blocking activity such > as a GC, it should have no impact on heartbeats, but from experience, it > really does) > > > > Cheers, > > Arnaud > > > > > > *De :* Gen Luo > *Envoyé :* jeudi 22 juillet 20

RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread LINZ, Arnaud
) Cheers, Arnaud De : Gen Luo Envoyé : jeudi 22 juillet 2021 05:46 À : Till Rohrmann Cc : Yang Wang ; dev ; user Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values Hi, Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I would give

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Gen Luo
Hi, Thanks for driving this @Till Rohrmann . I would give +1 on reducing the heartbeat timeout and interval, though I'm not sure if 15s and 3s would be enough either. IMO, except for the standalone cluster, where the heartbeat mechanism in Flink is totally relied, reducing the heartbeat can also

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Till Rohrmann
Thanks for sharing these insights. I think it is no longer true that the ResourceManager notifies the JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details. Given the GC pauses, would you then be ok with decreasing the heartbeat timeout to 20 seconds? This should give enough ti

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-20 Thread Yang Wang
Thanks @Till Rohrmann for starting this discussion Firstly, I try to understand the benefit of shorter heartbeat timeout. IIUC, it will make the JobManager aware of TaskManager faster. However, it seems that only the standalone cluster could benefit from this. For Yarn and native Kubernetes depl

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-20 Thread Robert Metzger
+1 to this change! When I was working on the reactive mode blog post [1] I also ran into this issue, leading to a poor "out of the box" experience when scaling down. For my experiments, I've chosen a timeout of 8 seconds, and the cluster has been running for 76 days (so far) on Kubernetes. I also

[DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-16 Thread Till Rohrmann
Hi everyone, Since Flink 1.5 we have the same heartbeat timeout and interval default values that are defined as heartbeat.timeout: 50s and heartbeat.interval: 10s. These values were mainly chosen to compensate for lengthy GC pauses and blocking operations that were executed in the main threads of