[ 
https://issues.apache.org/jira/browse/HIVE-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaoyu wang updated HIVE-2775:
------------------------------

    Description: 
Currently, hive bucketed table requires the number of files to match the bucket 
number in order to sample correctly. This is very restrictive. e.g. we can only 
populate the table using a fix number of reducer, which can be a bottleneck. 

The idea is to introduce the concept of "physical bucket" and "logical bucket". 
"physical bucket" is the number of files on disk and "logical bucket" is the 
number of bucket stored in meda-data for bucketed table. By allowing "physical 
bucket" to be a multiple of "logical bucket", we can do correct sampling as 
well as scaling up. 


  was:
Currently, hive bucketed table requires the number of files to match the bucket 
number in order to for correct sampling. This is very restrictive. e.g. we can 
only populate the table using a fix number of reducer, which can be a 
bottleneck. 

The idea is to introduce this "physical bucket" and "logical bucket" concept. 
"physical bucket" is the number of files and "logical bucket" is the number of 
bucket stored in meda-data for bucketed table. By allowing "physical bucket" to 
be a multiple of "logical bucket", we can do correct sampling as well as 
scaling up. 


    
> allow the number of files to be a multiple of bucketed table
> ------------------------------------------------------------
>
>                 Key: HIVE-2775
>                 URL: https://issues.apache.org/jira/browse/HIVE-2775
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: xiaoyu wang
>
> Currently, hive bucketed table requires the number of files to match the 
> bucket number in order to sample correctly. This is very restrictive. e.g. we 
> can only populate the table using a fix number of reducer, which can be a 
> bottleneck. 
> The idea is to introduce the concept of "physical bucket" and "logical 
> bucket". "physical bucket" is the number of files on disk and "logical 
> bucket" is the number of bucket stored in meda-data for bucketed table. By 
> allowing "physical bucket" to be a multiple of "logical bucket", we can do 
> correct sampling as well as scaling up. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to