[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

LingXiao Lan (JIRA) Sun, 12 Nov 2017 23:22:49 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


LingXiao Lan updated HIVE-18049:
--------------------------------
    Attachment: CombinedPartitioner.txt
                tez-0.8.5.txt

> Enable Hive on Tez to provide globally sorted clustered table
> -------------------------------------------------------------
>
>                 Key: HIVE-18049
>                 URL: https://issues.apache.org/jira/browse/HIVE-18049
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive, Tez
>            Reporter: LingXiao Lan
>             Fix For: 2.1.1
>
>         Attachments: CombinedPartitioner.txt, HIVE-18049.1.patch, 
> tez-0.8.5.txt
>
>
> {code:sql}
> CREATE TABLE `test`(
>    `time` int,
>    `userid` bigint)
>  CLUSTERED BY (
>    userid)
>  SORTED BY (
>    userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

Reply via email to