Re: [DISCUSS] Introduce Hash Lookup Join

Lincoln Lee Tue, 28 Dec 2021 18:01:18 -0800

Hi Jing,
    Thanks for bringing up this discussion!  Agree that this join hints
should benefit both bounded and unbounded cases as Martin mentioned.
I also agree that implementing the query hint is the right way for a more
general purpose since the dynamic table options has a limited scope.
   Some points I'd like to share are:
1. Regarding the hint name ‘USE_HASH’, could we consider more candidates?
Things are a little different from RDBMS in the distributed world, and we
also aim to solve the data skew problem, so all these incoming hints names
should be considered together.
2. As you mentioned in the flip, this solution depends on future changes to
calcite (and also upgrading calcite would be another possible big change:
at least calicite-1.30 vs 1.26, are we preparing to accept this big
change?). Is there another possible way to minimize the change in calcite?
As I know there're more limitations than `Correlate`.


Best,
Lincoln Lee


Jing Zhang <beyond1...@gmail.com> 于2021年12月28日周二 23:04写道：

> Hi Martijn,
> Thanks a lot for your attention.
> I'm sorry I didn't explain the motivation clearly. I would like to explain
> it in detail, and then give response on your questions.
> A lookup join is typically used to enrich a table with data that is queried
> from an external system. Many Lookup table sources introduce cache in order
> to reduce the RPC call, such as JDBC, CSV, HBase connectors.
> For those connectors, we could raise cache hit ratio by routing the same
> lookup keys to the same task instance. This is the purpose of
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join
> .
> Other cases might benefit from Hash distribution, such as batch hash join
> as you mentioned. It is a cool idea, however it is not the purpose of this
> FLIP, we could discuss this in FLINK-20670
> <https://issues.apache.org/jira/browse/FLINK-20670>.😀
>
> > - When I was reading about this topic [1] I was wondering if this feature
> would be more beneficial for bounded use cases and not so much for
> unbounded use cases. What do you think?
>
> As mentioned before, the purpose of Hash Lookup Join is to increase the
> cache hit ratio which is different from Oracle Hash Join. However we could
> use the similar hint syntax.
>
> > - If I look at the current documentation for SQL Hints in Flink [2], I
> notice that all of the hints there are located at the end of the SQL
> statement. In the FLIP, the use_hash is defined directly after the 'SELECT'
> keyword. Can we somehow make this consistent for the user? Or should the
> user be able to specify hints anywhere in its SQL statement?
>
> Calcite supports hints in two locations [3]:
> Query Hint: right after the SELECT keyword;
> Table Hint: right after the referenced table name.
> Now Flink has supported dynamic table options based on the Hint framework
> of Calcite which is mentioned in doc[2].
> Besides, query hints are also important, it could give a hint for
> optimizers to choose a better plan. Almost all popular databases and
> big-data engines support sql query hints, such as oracle, hive, spark and
> so on.
> I think using query hints in this case is more natural for users, WDYT?
>
> I have updated the motivation part in the FLIP,
> Thanks for the feedback!
>
> [1] https://logicalread.com/oracle-11g-hash-joins-mc02/#.V5Wm4_mnoUI
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/hints/
> [3] https://calcite.apache.org/docs/reference.html#sql-hints
>
> Best,
> Jing Zhang
>
> Martijn Visser <mart...@ververica.com> 于2021年12月28日周二 22:02写道：
>
> > Hi Jing,
> >
> > Thanks a lot for the explanation and the FLIP. I definitely learned
> > something when reading more about `use_hash`. My interpretation would be
> > that the primary benefit of a hash lookup join would be improved
> > performance by allowing the user to explicitly optimise the planner.
> >
> > I have a couple of questions:
> >
> > - When I was reading about this topic [1] I was wondering if this feature
> > would be more beneficial for bounded use cases and not so much for
> > unbounded use cases. What do you think?
> > - If I look at the current documentation for SQL Hints in Flink [2], I
> > notice that all of the hints there are located at the end of the SQL
> > statement. In the FLIP, the use_hash is defined directly after the
> 'SELECT'
> > keyword. Can we somehow make this consistent for the user? Or should the
> > user be able to specify hints anywhere in its SQL statement?
> >
> > Best regards,
> >
> > Martijn
> >
> > [1] https://logicalread.com/oracle-11g-hash-joins-mc02/#.V5Wm4_mnoUI
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/hints/
> >
> >
> > On Tue, 28 Dec 2021 at 08:17, Jing Zhang <beyond1...@gmail.com> wrote:
> >
> > > Hi everyone,
> > > Look up join
> > > <
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join
> > > >[1]
> > > is
> > > commonly used feature in Flink SQL. We have received many optimization
> > > requirements on look up join. For example:
> > > 1. Enforces left side of lookup join do a hash partitioner to raise
> cache
> > > hint ratio
> > > 2. Solves the data skew problem after introduces hash lookup join
> > > 3. Enables mini-batch optimization to reduce RPC call
> > >
> > > Next we will solve these problems one by one. Firstly,  we would focus
> on
> > > point 1, and continue to discuss point 2 and point 3 later.
> > >
> > > There are many similar requirements from user mail list and JIRA about
> > hash
> > > Lookup Join, for example:
> > > 1. FLINK-23687 <https://issues.apache.org/jira/browse/FLINK-23687> -
> > > Introduce partitioned lookup join to enforce input of LookupJoin to
> hash
> > > shuffle by lookup keys
> > > 2. FLINK-25396 <https://issues.apache.org/jira/browse/FLINK-25396> -
> > > lookupjoin source table for pre-partitioning
> > > 3. FLINK-25262 <https://issues.apache.org/jira/browse/FLINK-25262> -
> > > Support to send data to lookup table for KeyGroupStreamPartitioner way
> > for
> > > SQL.
> > >
> > > In this FLIP, I would like to start a discussion about Hash Lookup
> Join.
> > > The core idea is introducing a 'USE_HASH' hint in query.  This syntax
> is
> > > directly user-oriented and therefore requires careful design.
> > > There are two ways about how to propagate this hint to LookupJoin in
> > > optimizer. We need further discussion to do final decide. Anyway, the
> > > difference between the two solution is only about the internal
> > > implementation and has no impact on the user.
> > >
> > > For more detail on the proposal:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join
> > >
> > >
> > > Looking forward to your feedback, thanks.
> > >
> > > Best,
> > > Jing Zhang
> > >
> > > [1]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join
> > >
> >
>

Re: [DISCUSS] Introduce Hash Lookup Join

Reply via email to