[jira] [Created] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

LingXiao Lan (JIRA) Sun, 12 Nov 2017 20:29:18 -0800

LingXiao Lan created HIVE-18049:
-----------------------------------

             Summary: Enable Hive on Tez to provide globally sorted clustered 
table
                 Key: HIVE-18049
                 URL: https://issues.apache.org/jira/browse/HIVE-18049
             Project: Hive
          Issue Type: Improvement
          Components: Hive, Tez
            Reporter: LingXiao Lan
             Fix For: 2.1.1



CREATE TABLE `test`(
   `time` int,
   `userid` bigint)
 CLUSTERED BY (
   userid)
 SORTED BY (
   userid ASC)
 INTO 4 BUCKETS
 ;
When insert data into this table, the data will be sorted into 4 buckets 
automatically. But because hive uses hash partitioner by default, the data is 
only sorted in each bucket and isn't sorted among different buckets. Sometimes 
we need the data to be globally sorted, to optimizing indexing, for example.

If we can sample the table first and use TotalOrderPartitioner, this work could 
be done. The difficulty is how do we automatically decide when to use 
TotalOrderPartitioner and when not, because a insertion query can be complex, 
which results in a complex DAG in Tez.

I have implemented a temporary version. It uses a customer partitioner which 
combines hash partitioner and totalorder partitioner. A physical optimizer is 
added to hive to decide to choose which partitioner. But in order to reduce the 
work load, this version should affect tez source code, which is not necessary 
in fact.

I'm wondering if we can implement a more common version which addresses this 
issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

Reply via email to