Thanks for proposing this FLIP. Experiments have shown that it significantly enhances the real-time query experience. +1 for this.
Best, Weihua On Mon, Jan 8, 2024 at 5:19 PM Rui Fan <1996fan...@gmail.com> wrote: > Thanks Xiangyu for the quick update! > > LGTM > > Best, > Rui > > On Mon, Jan 8, 2024 at 4:27 PM xiangyu feng <xiangyu...@gmail.com> wrote: > > > Hi Rui and Yong, > > > > Thx for ur reply. > > > > My initial attention here is that for short-lived jobs under high QPS: a > > fixed delay retry strategy will cause extra resource waste and not > flexible > > enough, an exponential-backoff strategy might significantly increase the > > query latency since the interval time grows too fast. An > incremental-delay > > strategy could be balanced between resource consumption and short-query > > latency. > > > > With a second thought, an exponential-delay retry strategy with a > > configurable multiplier option can also achieve this goal. By setting the > > default value of multiplier to 1, we can be consistent with the original > > behavior and reduce the configuration items at the same time. > > > > I've updated this FLIP accordingly, look forward to your feedback. > > > > Regards, > > Xiangyu Feng > > > > > > Rui Fan <1996fan...@gmail.com> 于2024年1月8日周一 15:29写道: > > > >> Only one strategy is fine to me. > >> > >> When the multiplier is set to 1, the exponential-delay will become > >> fixed-delay. > >> So fixed-delay may not be needed. > >> > >> Best, > >> Rui > >> > >> On Mon, Jan 8, 2024 at 2:17 PM Yong Fang <zjur...@gmail.com> wrote: > >> > >> > I agree with @Rui that the current configuration for Flink Client is a > >> > little complex. Can we just provide one strategy with less > configuration > >> > items for all scenarios? > >> > > >> > Best, > >> > Fang Yong > >> > > >> > On Mon, Jan 8, 2024 at 11:19 AM Rui Fan <1996fan...@gmail.com> wrote: > >> > > >> > > Thanks xiangyu for driving this proposal! And sorry for the > >> > > late reply. > >> > > > >> > > Overall looks good to me, I only have some minor questions: > >> > > > >> > > 1. Do we need to introduce 3 collect strategies in the first > version? > >> > > > >> > > Large and comprehensive configuration items will bring > >> > > additional learning costs and usage costs to users. I tend to > >> > > provide users with out-of-the-box parameters and 2 collect > >> > > strategies may be enough for users. > >> > > > >> > > IIUC, there is no big difference between exponential-delay and > >> > > incremental-delay, especially the default parameters provided. > >> > > I wonder could we provide a multiplier for exponential-delay > strategy > >> > > and removing the incremental-delay strategy? > >> > > > >> > > Of course, if you think multiplier option is not needed based on > >> > > your production experience, it's totally fine for me. Simple is > >> better. > >> > > > >> > > 2. Which strategy do you think is best in mass production? > >> > > > >> > > I'm working on FLIP-364[1], it's related to Flink failover restart > >> > > strategy. IIUC, when one cluster only has a few flink jobs, > >> > > fixed-delay is fine. It guarantees minimal latency without too > >> > > much stress. But if one cluster has too many jobs, fixed-delay > >> > > may not be stable. > >> > > > >> > > Do you think exponential-delay is better than fixed delay in this > >> > > scenario? And which strategy is used in your production for now? > >> > > Would you mind sharing it? > >> > > > >> > > Looking forwarding to your opinion~ > >> > > > >> > > Best, > >> > > Rui > >> > > > >> > > On Sat, Jan 6, 2024 at 5:54 PM xiangyu feng <xiangyu...@gmail.com> > >> > wrote: > >> > > > >> > > > Hi all, > >> > > > > >> > > > Thanks for the comments. > >> > > > > >> > > > If there is no further comment, we will open the voting thread > next > >> > week. > >> > > > > >> > > > Regards, > >> > > > Xiangyu > >> > > > > >> > > > Zhanghao Chen <zhanghao.c...@outlook.com> 于2024年1月3日周三 16:46写道: > >> > > > > >> > > > > Thanks for driving this effort on improving the interactive use > >> > > > experience > >> > > > > of Flink. The proposal overall looks good to me. > >> > > > > > >> > > > > Best, > >> > > > > Zhanghao Chen > >> > > > > ________________________________ > >> > > > > From: xiangyu feng <xiangyu...@gmail.com> > >> > > > > Sent: Tuesday, December 26, 2023 16:51 > >> > > > > To: dev@flink.apache.org <dev@flink.apache.org> > >> > > > > Subject: [Discuss] FLIP-407: Improve Flink Client performance in > >> > > > > interactive scenarios > >> > > > > > >> > > > > Hi devs, > >> > > > > > >> > > > > I'm opening this thread to discuss FLIP-407: Improve Flink > Client > >> > > > > performance in interactive scenarios. The POC test results and > >> design > >> > > doc > >> > > > > can be found at: FLIP-407 > >> > > > > < > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-407%3A+Improve+Flink+Client+performance+when+interacting+with+dedicated+Flink+Session+Clusters > >> > > > > > > >> > > > > . > >> > > > > > >> > > > > Currently, Flink Client is mainly designed for one time > >> interaction > >> > > with > >> > > > > the Flink Cluster. All the resources(http connections, threads, > ha > >> > > > > services) and instances(ClusterDescriptor, ClusterClient, > >> RestClient) > >> > > are > >> > > > > created and recycled for each interaction. This works well when > >> users > >> > > do > >> > > > > not need to interact frequently with Flink Cluster and also > saves > >> > > > resource > >> > > > > usage since resources are recycled immediately after each usage. > >> > > > > > >> > > > > However, in OLAP or StreamingWarehouse scenarios, users might > >> submit > >> > > > > interactive jobs to a dedicated Flink Session Cluster very > often. > >> In > >> > > this > >> > > > > case, we find that for short queries that can finish in less > than > >> 1s > >> > in > >> > > > > Flink Cluster will still have E2E latency greater than 2s. > Hence, > >> we > >> > > > > propose this FLIP to improve the Flink Client performance in > this > >> > > > scenario. > >> > > > > This could also improve the user experience when using session > >> debug > >> > > > mode. > >> > > > > > >> > > > > The major change in this FLIP is that there will be a new > >> introduced > >> > > > option > >> > > > > *'execution.interactive-client'*. When this option is enabled, > >> Flink > >> > > > > Client will reuse all the necessary resources to improve > >> interactive > >> > > > > performance, including: HA Services, HTTP connections, threads > and > >> > all > >> > > > > kinds of instances related to a long-running Flink Cluster. The > >> > default > >> > > > > value of this option will be false, then Flink Client will > behave > >> as > >> > > > > before. > >> > > > > > >> > > > > Also, this FLIP proposed a configurable RetryStrategy when > >> fetching > >> > > > results > >> > > > > from client-side to Flink Cluster. In interactive scenarios, > this > >> can > >> > > > save > >> > > > > more than 15% of TM CPU usage without performance degradation. > >> > > > > > >> > > > > Looking forward to your feedback, thanks. > >> > > > > > >> > > > > Best regards, > >> > > > > Xiangyu > >> > > > > > >> > > > > >> > > > >> > > >> > > >