[ https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LingXiao Lan updated HIVE-18049: -------------------------------- Description: {code:sql} {code} CREATE TABLE `test`( `time` int, `userid` bigint) CLUSTERED BY ( userid) SORTED BY ( userid ASC) INTO 4 BUCKETS ; When insert data into this table, the data will be sorted into 4 buckets automatically. But because hive uses hash partitioner by default, the data is only sorted in each bucket and isn't sorted among different buckets. Sometimes we need the data to be globally sorted, to optimizing indexing, for example. If we can sample the table first and use TotalOrderPartitioner, this work could be done. The difficulty is how do we automatically decide when to use TotalOrderPartitioner and when not, because a insertion query can be complex, which results in a complex DAG in Tez. I have implemented a temporary version. It uses a customer partitioner which combines hash partitioner and totalorder partitioner. A physical optimizer is added to hive to decide to choose which partitioner. But in order to reduce the work load, this version should affect tez source code, which is not necessary in fact. I'm wondering if we can implement a more common version which addresses this issue. was: CREATE TABLE `test`( `time` int, `userid` bigint) CLUSTERED BY ( userid) SORTED BY ( userid ASC) INTO 4 BUCKETS ; When insert data into this table, the data will be sorted into 4 buckets automatically. But because hive uses hash partitioner by default, the data is only sorted in each bucket and isn't sorted among different buckets. Sometimes we need the data to be globally sorted, to optimizing indexing, for example. If we can sample the table first and use TotalOrderPartitioner, this work could be done. The difficulty is how do we automatically decide when to use TotalOrderPartitioner and when not, because a insertion query can be complex, which results in a complex DAG in Tez. I have implemented a temporary version. It uses a customer partitioner which combines hash partitioner and totalorder partitioner. A physical optimizer is added to hive to decide to choose which partitioner. But in order to reduce the work load, this version should affect tez source code, which is not necessary in fact. I'm wondering if we can implement a more common version which addresses this issue. > Enable Hive on Tez to provide globally sorted clustered table > ------------------------------------------------------------- > > Key: HIVE-18049 > URL: https://issues.apache.org/jira/browse/HIVE-18049 > Project: Hive > Issue Type: Improvement > Components: Hive, Tez > Reporter: LingXiao Lan > Fix For: 2.1.1 > > > {code:sql} > {code} > CREATE TABLE `test`( > `time` int, > `userid` bigint) > CLUSTERED BY ( > userid) > SORTED BY ( > userid ASC) > INTO 4 BUCKETS > ; > When insert data into this table, the data will be sorted into 4 buckets > automatically. But because hive uses hash partitioner by default, the data is > only sorted in each bucket and isn't sorted among different buckets. > Sometimes we need the data to be globally sorted, to optimizing indexing, for > example. > If we can sample the table first and use TotalOrderPartitioner, this work > could be done. The difficulty is how do we automatically decide when to use > TotalOrderPartitioner and when not, because a insertion query can be complex, > which results in a complex DAG in Tez. > I have implemented a temporary version. It uses a customer partitioner which > combines hash partitioner and totalorder partitioner. A physical optimizer is > added to hive to decide to choose which partitioner. But in order to reduce > the work load, this version should affect tez source code, which is not > necessary in fact. > I'm wondering if we can implement a more common version which addresses this > issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)