Maciek Kocon created HIVE-9523:
----------------------------------

             Summary: when columns on which tables are partitioned are used in 
the join condition same join optimizations as for bucketed tables should be 
applied
                 Key: HIVE-9523
                 URL: https://issues.apache.org/jira/browse/HIVE-9523
             Project: Hive
          Issue Type: Improvement
          Components: Logical Optimizer, Physical Optimizer, SQL
    Affects Versions: 0.13.1, 0.14.0, 0.13.0
            Reporter: Maciek Kocon


For JOIN conditions where partitioning criteria are used respectively:
            ⋮ 
FROM TabA JOIN TabB
   ON TabA.partCol1 = TabB.partCol2
   AND TabA.partCol2 = TabB.partCol2

the optimizer could/should choose to treat it the same way as with bucketed 
tables: ⋮ 
FROM TabC
  JOIN TabD
     ON TabC.clusteredByCol1 = TabD.clusteredByCol2
   AND TabC.clusteredByCol2 = TabD.clusteredByCol2

and use either Bucket Map Join or better, the Sort Merge Bucket Map Join.

This is based on fact that same way as buckets translate to separate files, the 
partitions essentially provide the same mapping.
When data locality is known the optimizer could focus only on joining 
corresponding partitions rather than whole data sets.

#side notes:
⦿ Currently Table DDL Syntax where Partitioning and Bucketing defined at the 
same time is allowed:
CREATE TABLE
 ⋮
PARTITIONED BY(…) CLUSTERED BY(…) INTO … BUCKETS;

But in this case optimizer never chooses to use Bucket Map Join or Sort Merge 
Bucket Map Join which defeats the purpose of creating BUCKETed tables in such 
scenarios. Should that be raised as a separate BUG?

⦿ Currently partitioning and bucketing are two separate things but serve same 
purpose - shouldn't the concept be merged (explicit/implicit partitions?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to