Thanks Jing,
Looks good~
Best,
Jingsong
On Fri, Jan 21, 2022 at 2:00 PM Lincoln Lee wrote:
>
> Hi, Jing
> Glad to hear the agreement on the hint syntax, let's keep going!
>
> Best,
> Lincoln Lee
>
>
> Jing Zhang 于2022年1月20日周四 16:52写道:
>
> > Hi Jingsong,
> > Thanks for the feedback.
> >
> > >
Hi, Jing
Glad to hear the agreement on the hint syntax, let's keep going!
Best,
Lincoln Lee
Jing Zhang 于2022年1月20日周四 16:52写道:
> Hi Jingsong,
> Thanks for the feedback.
>
> > Is there a conclusion about naming here? (Maybe I missed something?)
> Use USE_HASH or some other names. Slightly conf
Hi Jingsong,
Thanks for the feedback.
> Is there a conclusion about naming here? (Maybe I missed something?)
Use USE_HASH or some other names. Slightly confusing in the FLIP.
'SHUFFLE_HASH' is final hint name, 'USE_HASH' is rejected. I've updated the
FLIP.
> And the problem of what to write insi
Hi Lincoln,
Thanks for the feedback.
> The previous discussion for the hint syntax
left a minor difference there: whether to use both sides of join table
names or just one 'build' side table name only. I would prefer the later
one.
Users only need to pass the `build` side table(usually the smalle
Hi Jing,
About `SHUFFLE_HASH(left_table, right_table)`, one case can be shared:
SELECT * FROM left_t
JOIN right_1 ON ...
JOIN right_2 ON ...
JOIN right_3 ON ...
What if we want to use shuffle_hash for all three joints?
SELECT /*+ SHUFFLE_HASH('left_t', 'right_1', 'right_2', 'right_3') */
Hi Jing,
Sorry for the late reply!
Is there a conclusion about naming here? (Maybe I missed something?)
Use USE_HASH or some other names. Slightly confusing in the FLIP.
And the problem of what to write inside the hint, as mentioned by lincoln.
I think maybe we can list the grammars of other di
Hi, Jing,
Sorry for the late reply! The previous discussion for the hint syntax
left a minor difference there: whether to use both sides of join table
names or just one 'build' side table name only. I would prefer the later
one.
Users only need to pass the `build` side table(usually the smalle
Hi all,
Thanks for all the feedback so far.
If there is no more suggestions, I would like to drive a vote in Tuesday
next week (18 Jan).
Best,
Jing Zhang
Jing Zhang 于2022年1月5日周三 11:33写道:
> Hi Francesco,
> Thanks a lot for the feedback.
>
> > does it makes sense for a lookup join to use hash dis
Hi Francesco,
Thanks a lot for the feedback.
> does it makes sense for a lookup join to use hash distribution whenever
is possible by default?
I prefer to enable the hash lookup join only find the hint in the query for
the following reason:
1. Plan compatibility
Add a hash shuffle by default w
Hi Jing,
Thanks for the FLIP. I'm not very knowledgeable about the topic, but going
through both the FLIP and the discussion here, I wonder, does it makes
sense for a lookup join to use hash distribution whenever is possible by
default?
The point you're explaining here:
> Many Lookup table sourc
Hi Lincoln,
Thanks for the feedback.
> 1. For the hint name, +1 for WenLong's proposal.
I've added add 'SHUFFLE_HASH' to other alternatives in FLIP. Let's waiting
for more voices here.
> Regarding the `SKEW` hint, agree with you that it can be used widely, and
I
prefer to treat it as a metadata
Hi Martijn,
Thanks for the feedback.
Glad to hear that we reached a consensus on the first and second point.
About whether to use `use_hash` as a term, I think your concern makes sense.
Although the hash lookup join is similar to Hash join in oracle that they
all require hash distribution on inpu
Hi Jing,
Thanks for your explanation!
1. For the hint name, +1 for WenLong's proposal. I think the `SHUFFLE`
keyword is important in a classic distributed computing system,
a hash-join usually means there's a shuffle stage(include shuffle
hash-join, broadcast hash-join). Users only need to pas
Hi Jing,
Thanks for explaining this in more detail and also to others
participating.
> I think using query hints in this case is more natural for users, WDYT?
Yes, I agree. As long as we properly explain in our documentation that we
support both Query Hints and Table Hints, what's the difference
Hi Jian gang,
Thanks for the feedback.
> When it comes to hive, how do you load partial data instead of the
whole data? Any change related with hive?
The question is same as Yuan mentioned before.
I prefer to drive another FLIP on this topic to further discussion
individually because this poin
Hi Wenlong,
Thanks for the feedback.
I've checked similar syntax in other systems, they are all different from
each other. It seems to be without consensus.
As mentioned in FLIP-204, oracle uses a query hint, the hint name is
'use_hash' [1].
Spark also uses a query hint, its name is 'SHUFFLE_HASH'
Thank you for the proposal, Jing. I like the idea to partition data by some
key to improve the cache hit. I have some questions:
1. When it comes to hive, how do you load partial data instead of the
whole data? Any change related with hive?
2. How to define the cache configuration? For ex
Hi, Jing, thanks for driving the discussion.
Have you made some investigation on the syntax of join hint?
Why do you choose USE_HASH from oracle instead of the style of spark
SHUFFLE_HASH, they are quite different.
People in the big data world may be more familiar with spark/hive, if we
need to ch
Hi, Lincoln
Thanks a lot for the feedback.
> Regarding the hint name ‘USE_HASH’, could we consider more candidates?
Things are a little different from RDBMS in the distributed world, and we
also aim to solve the data skew problem, so all these incoming hints names
should be considered together.
Hi Yuan and Lincoln,
thanks a lot for the attention. I would answer the email one by one.
To Yuan
> How shall we deal with CDC data? If there is CDC data in the pipeline,
IMHO, shuffle by join key will cause CDC data disorder. Will it be better
to use primary key in this case?
Good question.
The
Hi Jing,
Thanks for bringing up this discussion! Agree that this join hints
should benefit both bounded and unbounded cases as Martin mentioned.
I also agree that implementing the query hint is the right way for a more
general purpose since the dynamic table options has a limited scope.
Som
Hi Jing,
Thanks very much for your FLIP. I have some points:
- How shall we deal with CDC data? If there is CDC data in the pipeline, IMHO,
shuffle by join key will cause CDC data disorder. Will it be better to use
primary key in this case?
- If the shuffle keys can be customized when user
Hi Martijn,
Thanks a lot for your attention.
I'm sorry I didn't explain the motivation clearly. I would like to explain
it in detail, and then give response on your questions.
A lookup join is typically used to enrich a table with data that is queried
from an external system. Many Lookup table sour
Hi Jing,
Thanks a lot for the explanation and the FLIP. I definitely learned
something when reading more about `use_hash`. My interpretation would be
that the primary benefit of a hash lookup join would be improved
performance by allowing the user to explicitly optimise the planner.
I have a coup
24 matches
Mail list logo