[ 
https://issues.apache.org/jira/browse/IGNITE-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Koptilin updated IGNITE-8732:
----------------------------------------
    Fix Version/s: 2.11

> SQL: REPLICATED cache cannot be left-joined to PARTITIONED
> ----------------------------------------------------------
>
>                 Key: IGNITE-8732
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8732
>             Project: Ignite
>          Issue Type: Improvement
>          Components: sql
>    Affects Versions: 2.5
>            Reporter: Vladimir Ozerov
>            Assignee: Stanislav Lukyanov
>            Priority: Major
>              Labels: sql-engine
>             Fix For: 2.11
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> *Steps to reproduce*
>  # Run 
> {{org.apache.ignite.sqltests.ReplicatedSqlTest#testLeftJoinReplicatedPartitioned}}
>  # Observe that we have 2x results on 2-node cluster
> *Root Cause*
>  {{left LEFT JOIN right ON cond}} operation assumes full scan of of a left 
> expression. Currently we perform this scan on every node and then simply 
> merge results on reducer. Two nodes, two scans of {{REPLICATED}} cache, 2x 
> results.
> *Potential Solutions*
>  We may consider several solutions. Deeper analysis is required to understand 
> which is the right one.
>  # Perform deduplication on reducer - this most prospective and general 
> technique, described in more details below
>  # Treat {{REPLICATED}} cache as {{PARTITIONED}}. Essentially, we just need 
> to pass proper backup filter. But what if {{REPLICATED}} cache spans more 
> nodes than {{PARTITIONED}}? We cannot rely on primary/backup in this case
>  # Implement additional execution phase as follows:
> {code:java}
> SELECT left.cols, right.cols FROM left INNER JOIN right ON cond;              
>             // Get "inner join" part
> UNION
> UNICAST SELECT left.cols, [NULL].cols FROM left WHERE left.id NOT IN ([ids 
> from the first phase]) // Get "outer join" part
> {code}
> *Reducer Deduplication*
>  The idea is to get all data locally and then perform final deduplication. 
> This may incur high network overhead, because of lot of duplicated left parts 
> would be transferred. However, this could be optimized greatly with the 
> following techniques applied one after another
>  # Semi-jions: {{left}} is {{joined}} on mapper node, but instead of sending 
> {{(left, right)}} relation, we send {{(left) + (right)}}
>  # In case {{left}} part is known to be idempotent (i.e. it produces the same 
> result set on all nodes), only one node will send {{(left) + (right)}}, other 
> nodes will send {{(right)}} only
>  # Merge {{left}} results with if needed (i.e. if idempotence-related opto 
> was not applicable)
>  # Join {{left}} and {{right}} parts on reducer
> *UPDATE*
> After a few attempts at the implementation, the solution of treating 
> REPLICATED cache as PARTITIONED looks the most practical. The solution works 
> in a limited case:
>  * REPLICATED and PARTITIONED both have the same affinity function, number of 
> partitions, node filter
>  ** Note that REPLICATED has a different number of partitions by default
>  * The JOIN is done on an affinity column of both caches
>  ** Note that users often don’t create affinity keys for REPLICATED caches 
> today
>  * distributedJoins=false (distributed joins aren’t supported for now)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to