[ 
https://issues.apache.org/jira/browse/CALCITE-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913391#comment-16913391
 ] 

Ruben Quesada Lopez commented on CALCITE-2973:
----------------------------------------------

[~hhlai1990], thanks for this PR, I think it generally looks in a good shape. I 
have one small concern though regarding the new implementation of 
{{EnumerableDefaults#hashJoin_}}.
TLDR; I think this change may impact performance of RIGHT / FULL hash equi 
joins.

The addition of the new predicate to support all types of join conditions (and 
not just equi-joins) requires to change the {{Set<TKey> unmatchedKeys}} into a 
{{List<TInner> innersUnmatched}}. If I understand correctly, it is required to 
do so because with this change we may have two inner results with the same TKey 
(which is based on the equi-condition), one being a match and the other not 
being an actual match due to the new extra (non-equi) predicate. This 
{{innersUnmatched}} list will be used in RIGHT / FULL joins to keep the results 
from the right input which had no match, in order to finally generate the 
results with null on the left. In case of INNER / LEFT join, this unmatch List 
(or Set in the previous version) will not be used, so what I am about to say is 
not really relevant. 
The problem with the new {{List<TInner> innersUnmatched}} for the RIGHT / FULL 
joins is that it will be pre-filled with ALL the right input (inner) results, 
and the process will remove the ones which actually had a match. Previously, 
this removal was implemented via a {{HashSet#remove(TKey)}}, which has a better 
performance than the new {{ArrayList#removeAll(List<TInner>)}}, specially in 
cases of RIGHT / FULL joins on big, big tables. This shall happen even if the 
new predicate is null (i.e. we are dealing with a pure equi-join). What I am 
trying to say is that, with this change, it can be expected a drop in the 
performance of RIGHT / FULL hash equi joins, compared to previous calcite 
versions, specially on big data volumes.
I have not make any measurements, so I'm not sure if the performance impact 
will be relevant or not, so maybe I am making a big deal out of nothing. 
A possible solution (although I'm not 100% convinced) could be keeping the 
"old" EnumerableDefaults method for hash joins (maybe rename it as 
{{hashEquiJoin_}}) and use it for pure equi joins. The new method {{hashJoin_}} 
in the PR with the extra predicate will implement any type of join condition 
(and will redirect to the previous one if predicate is null, i.e. if this is 
actually an equi join). Pros: same performance for hash equi joins is 
guaranteed. Cons: code duplicated, complex maintainability, etc.


> Allow theta joins that have equi conditions to be executed using a hash join 
> algorithm
> --------------------------------------------------------------------------------------
>
>                 Key: CALCITE-2973
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2973
>             Project: Calcite
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Lai Zhou
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Now the EnumerableMergeJoinRule only supports an inner and equi join.
> If users make a theta-join query  for a large dataset (such as 10000*10000), 
> the nested-loop join process will take dozens of time than the sort-merge 
> join process .
> So if we can apply merge-join or hash-join rule for a theta join, it will 
> improve the performance greatly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to