Maciek Kocon created HIVE-9523: ---------------------------------- Summary: when columns on which tables are partitioned are used in the join condition same join optimizations as for bucketed tables should be applied Key: HIVE-9523 URL: https://issues.apache.org/jira/browse/HIVE-9523 Project: Hive Issue Type: Improvement Components: Logical Optimizer, Physical Optimizer, SQL Affects Versions: 0.13.1, 0.14.0, 0.13.0 Reporter: Maciek Kocon
For JOIN conditions where partitioning criteria are used respectively: ⋮ FROM TabA JOIN TabB ON TabA.partCol1 = TabB.partCol2 AND TabA.partCol2 = TabB.partCol2 the optimizer could/should choose to treat it the same way as with bucketed tables: ⋮ FROM TabC JOIN TabD ON TabC.clusteredByCol1 = TabD.clusteredByCol2 AND TabC.clusteredByCol2 = TabD.clusteredByCol2 and use either Bucket Map Join or better, the Sort Merge Bucket Map Join. This is based on fact that same way as buckets translate to separate files, the partitions essentially provide the same mapping. When data locality is known the optimizer could focus only on joining corresponding partitions rather than whole data sets. #side notes: ⦿ Currently Table DDL Syntax where Partitioning and Bucketing defined at the same time is allowed: CREATE TABLE ⋮ PARTITIONED BY(…) CLUSTERED BY(…) INTO … BUCKETS; But in this case optimizer never chooses to use Bucket Map Join or Sort Merge Bucket Map Join which defeats the purpose of creating BUCKETed tables in such scenarios. Should that be raised as a separate BUG? ⦿ Currently partitioning and bucketing are two separate things but serve same purpose - shouldn't the concept be merged (explicit/implicit partitions?) -- This message was sent by Atlassian JIRA (v6.3.4#6332)