Re: [DISCUSS] FLIP-315: Support Operator Fusion Codegen for Flink SQL

Aitozi Wed, 07 Jun 2023 08:33:00 -0700

Hi Ron:
    Sorry for the late reply after the voting process. I just want to ask


> Traverse the ExecNode DAG and create a FusionExecNode  for physical
operators that can be fused together.
which kind of operators can be fused together ? are the operators in an
operator chain? Is this optimization aligned to spark's whole stage codegen
?

> If any member operator does not support codegen, generate a
Transformation DAG based on the topological relationship of member ExecNode
 and jump to step 8.
step8: Generate a FusionTransformation, setting the parallelism and managed
memory for the fused operator.

does the "support codegen" means fusion codegen? but why we generate a
FusionTransformation when the member operator does not support codegen, IMO
it should
fallback to the current behavior.

In the end, I share the same idea with Lincoln about performance benchmark.
Currently flink community's flink-benchmark only covers like schedule,
state, datastream operator's performance.
A good benchmark harness for sql operator will benefit the sql optimizer
topic and observation

Thanks,
Atiozi.


liu ron <[email protected]> 于2023年6月6日周二 19:30写道：

> Hi dev
>
> Thanks for all the feedback, it seems that here are no more comments, I
> will
> start a vote on FLIP-315 [1] later. Thanks again.
>
> [1]:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-315+Support+Operator+Fusion+Codegen+for+Flink+SQL
>
> Best,
> Ron
>
> liu ron <[email protected]> 于2023年6月5日周一 16:01写道：
>
> > Hi, Yun, Jinsong, Benchao
> >
> > Thanks for your valuable input about this FLIP.
> >
> > First of all, let me emphasize that from the technical implementation
> > point of view, this design is feasible in both stream and batch
> scenarios,
> > so I consider both stream and batch mode in FLIP. In the stream scenario,
> > for stateful operator, according to our business experience, basically
> the
> > bottleneck is on the state access, so the optimization effect of OFCG for
> > the stream will not be particularly obvious, so we will not give priority
> > to support it currently. On the contrary, in the batch scenario, where
> CPU
> > is the bottleneck, this optimization is gainful.
> >
> > Taking the above into account, we are able to support both stream and
> > batch mode optimization in this design, but we will give priority to
> > supporting batch operators. As benchao said, when we find a suitable
> > streaming business scenario in the future, we can consider doing this
> > optimization. Back to Yun issue, the design will break state
> compatibility
> > in stream mode as[1] and the version upgrade will not support this OFCG.
> As
> > mentioned earlier, we will not support this feature in stream mode in the
> > short term.
> >
> > Also thanks to Benchao's suggestion, I will state the current goal of
> that
> > optimization in the FLIP, scoped to batch mode.
> >
> > Best,
> > Ron
> >
> > liu ron <[email protected]> 于2023年6月5日周一 15:04写道：
> >
> >> Hi, Lincoln
> >>
> >> Thanks for your appreciation of this design. Regarding your question:
> >>
> >> > do we consider adding a benchmark for the operators to intuitively
> >> understand the improvement brought by each improvement?
> >>
> >> I think it makes sense to add a benchmark, Spark also has this benchmark
> >> framework. But I think it is another story to introduce a benchmark
> >> framework in Flink, we need to start a new discussion to this work.
> >>
> >> > for the implementation plan, mentioned in the FLIP that 1.18 will
> >> support Calc, HashJoin and HashAgg, then what will be the next step? and
> >> which operators do we ultimately expect to cover (all or specific ones)?
> >>
> >> Our ultimate goal is to support all operators in batch mode, but we
> >> prioritize them according to their usage. Operators like Calc, HashJoin,
> >> HashAgg, etc. are more commonly used, so we will support them first.
> Later
> >> we support the rest of the operators step by step. Considering the time
> >> factor and the development workload, so we can only support  Calc,
> >> HashJoin, HashAgg in 1.18. In 1.19 or 1.20, we will complete the rest
> work.
> >> I will make this clear in FLIP
> >>
> >> Best,
> >> Ron
> >>
> >> Jingsong Li <[email protected]> 于2023年6月5日周一 14:15写道：
> >>
> >>> > For the state compatibility session, it seems that the checkpoint
> >>> compatibility would be broken just like [1] did. Could FLIP-190 [2]
> still
> >>> be helpful in this case for SQL version upgrades?
> >>>
> >>> I guess this is only for batch processing. Streaming should be another
> >>> story?
> >>>
> >>> Best,
> >>> Jingsong
> >>>
> >>> On Mon, Jun 5, 2023 at 2:07 PM Yun Tang <[email protected]> wrote:
> >>> >
> >>> > Hi Ron,
> >>> >
> >>> > I think this FLIP would help to improve the performance, looking
> >>> forward to its completion in Flink!
> >>> >
> >>> > For the state compatibility session, it seems that the checkpoint
> >>> compatibility would be broken just like [1] did. Could FLIP-190 [2]
> still
> >>> be helpful in this case for SQL version upgrades?
> >>> >
> >>> >
> >>> > [1]
> >>>
> https://docs.google.com/document/d/1qKVohV12qn-bM51cBZ8Hcgp31ntwClxjoiNBUOqVHsI/edit#heading=h.fri5rtpte0si
> >>> > [2]
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=191336489
> >>> >
> >>> > Best
> >>> > Yun Tang
> >>> >
> >>> > ________________________________
> >>> > From: Lincoln Lee <[email protected]>
> >>> > Sent: Monday, June 5, 2023 10:56
> >>> > To: [email protected] <[email protected]>
> >>> > Subject: Re: [DISCUSS] FLIP-315: Support Operator Fusion Codegen for
> >>> Flink SQL
> >>> >
> >>> > Hi Ron
> >>> >
> >>> > OFGC looks like an exciting optimization, looking forward to its
> >>> completion
> >>> > in Flink!
> >>> > A small question, do we consider adding a benchmark for the operators
> >>> to
> >>> > intuitively understand the improvement brought by each improvement?
> >>> > In addition, for the implementation plan, mentioned in the FLIP that
> >>> 1.18
> >>> > will support Calc, HashJoin and HashAgg, then what will be the next
> >>> step?
> >>> > and which operators do we ultimately expect to cover (all or specific
> >>> ones)?
> >>> >
> >>> > Best,
> >>> > Lincoln Lee
> >>> >
> >>> >
> >>> > liu ron <[email protected]> 于2023年6月5日周一 09:40写道：
> >>> >
> >>> > > Hi, Jark
> >>> > >
> >>> > > Thanks for your feedback, according to my initial assessment, the
> >>> work
> >>> > > effort is relatively large.
> >>> > >
> >>> > > Moreover, I will add a test result of all queries to the FLIP.
> >>> > >
> >>> > > Best,
> >>> > > Ron
> >>> > >
> >>> > > Jark Wu <[email protected]> 于2023年6月1日周四 20:45写道：
> >>> > >
> >>> > > > Hi Ron,
> >>> > > >
> >>> > > > Thanks a lot for the great proposal. The FLIP looks good to me in
> >>> > > general.
> >>> > > > It looks like not an easy work but the performance sounds
> >>> promising. So I
> >>> > > > think it's worth doing.
> >>> > > >
> >>> > > > Besides, if there is a complete test graph with all TPC-DS
> >>> queries, the
> >>> > > > effect of this FLIP will be more intuitive.
> >>> > > >
> >>> > > > Best,
> >>> > > > Jark
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > On Wed, 31 May 2023 at 14:27, liu ron <[email protected]>
> wrote:
> >>> > > >
> >>> > > > > Hi, Jinsong
> >>> > > > >
> >>> > > > > Thanks for your valuable suggestions.
> >>> > > > >
> >>> > > > > Best,
> >>> > > > > Ron
> >>> > > > >
> >>> > > > > Jingsong Li <[email protected]> 于2023年5月30日周二 13:22写道：
> >>> > > > >
> >>> > > > > > Thanks Ron for your information.
> >>> > > > > >
> >>> > > > > > I suggest that it can be written in the Motivation of FLIP.
> >>> > > > > >
> >>> > > > > > Best,
> >>> > > > > > Jingsong
> >>> > > > > >
> >>> > > > > > On Tue, May 30, 2023 at 9:57 AM liu ron <[email protected]>
> >>> wrote:
> >>> > > > > > >
> >>> > > > > > > Hi, Jingsong
> >>> > > > > > >
> >>> > > > > > > Thanks for your review. We have tested it in TPC-DS case,
> >>> and got a
> >>> > > > 12%
> >>> > > > > > > gain overall when only supporting only
> Calc&HashJoin&HashAgg
> >>> > > > operator.
> >>> > > > > In
> >>> > > > > > > some queries, we even get more than 30% gain, it looks like
> >>> an
> >>> > > > > effective
> >>> > > > > > > way.
> >>> > > > > > >
> >>> > > > > > > Best,
> >>> > > > > > > Ron
> >>> > > > > > >
> >>> > > > > > > Jingsong Li <[email protected]> 于2023年5月29日周一
> 14:33写道：
> >>> > > > > > >
> >>> > > > > > > > Thanks Ron for the proposal.
> >>> > > > > > > >
> >>> > > > > > > > Do you have some benchmark results for the performance
> >>> > > > improvement? I
> >>> > > > > > > > am more concerned about the improvement on Flink than the
> >>> data in
> >>> > > > > > > > other papers.
> >>> > > > > > > >
> >>> > > > > > > > Best,
> >>> > > > > > > > Jingsong
> >>> > > > > > > >
> >>> > > > > > > > On Mon, May 29, 2023 at 2:16 PM liu ron <
> >>> [email protected]>
> >>> > > > wrote:
> >>> > > > > > > > >
> >>> > > > > > > > > Hi, dev
> >>> > > > > > > > >
> >>> > > > > > > > > I'd like to start a discussion about FLIP-315: Support
> >>> Operator
> >>> > > > > > Fusion
> >>> > > > > > > > > Codegen for Flink SQL[1]
> >>> > > > > > > > >
> >>> > > > > > > > > As main memory grows, query performance is more and
> more
> >>> > > > determined
> >>> > > > > > by
> >>> > > > > > > > the
> >>> > > > > > > > > raw CPU costs of query processing itself, this is due
> to
> >>> the
> >>> > > > query
> >>> > > > > > > > > processing techniques based on interpreted execution
> >>> shows poor
> >>> > > > > > > > performance
> >>> > > > > > > > > on modern CPUs due to lack of locality and frequent
> >>> instruction
> >>> > > > > > > > > mis-prediction. Therefore, the industry is also
> >>> researching how
> >>> > > > to
> >>> > > > > > > > improve
> >>> > > > > > > > > engine performance by increasing operator execution
> >>> efficiency.
> >>> > > > In
> >>> > > > > > > > > addition, during the process of optimizing Flink's
> >>> performance
> >>> > > > for
> >>> > > > > > TPC-DS
> >>> > > > > > > > > queries, we found that a significant amount of CPU time
> >>> was
> >>> > > spent
> >>> > > > > on
> >>> > > > > > > > > virtual function calls, framework collector calls, and
> >>> invalid
> >>> > > > > > > > > calculations, which can be optimized to improve the
> >>> overall
> >>> > > > engine
> >>> > > > > > > > > performance. After some investigation, we found
> Operator
> >>> Fusion
> >>> > > > > > Codegen
> >>> > > > > > > > > which is proposed by Thomas Neumann in the paper[2] can
> >>> address
> >>> > > > > these
> >>> > > > > > > > > problems. I have finished a PoC[3] to verify its
> >>> feasibility
> >>> > > and
> >>> > > > > > > > validity.
> >>> > > > > > > > >
> >>> > > > > > > > > Looking forward to your feedback.
> >>> > > > > > > > >
> >>> > > > > > > > > [1]:
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-315+Support+Operator+Fusion+Codegen+for+Flink+SQL
> >>> > > > > > > > > [2]: http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
> >>> > > > > > > > > [3]: https://github.com/lsyldliu/flink/tree/OFCG
> >>> > > > > > > > >
> >>> > > > > > > > > Best,
> >>> > > > > > > > > Ron
> >>> > > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>>
> >>
>

Re: [DISCUSS] FLIP-315: Support Operator Fusion Codegen for Flink SQL

Reply via email to