[ 
https://issues.apache.org/jira/browse/HIVE-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461708#comment-13461708
 ] 

Namit Jain commented on HIVE-3502:
----------------------------------

A very useful follow-up optimization for this can be:

For any hive query, which requires more than 1 MR job, the second MR job has 
mostly an identity mapper
and most of the work is done in the reducer. If the output of the first MR job 
can be bucketized based
on the requirements of the 2nd MR job, the 2nd MR job does not need a reducer 
at all.
                
> design efficient bucketing techniques
> -------------------------------------
>
>                 Key: HIVE-3502
>                 URL: https://issues.apache.org/jira/browse/HIVE-3502
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Currently, the bucketing techniques are fairly expensive - The bucketing keys 
> have to be the same as the reduction keys and the process of bucketization 
> requires
> a fully blown map-reduce job.
> It should be possible to perform a map-side bucketization. The high level 
> idea is
> to shard the data based on the number of buckets, and create a sub-directory 
> for each
> bucket. Then, the data from all the mappers (in the same sub-directory) can 
> be merged.
> So, instead of having 1 file per directory, it would lead to 1 directory per 
> directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to