[ 
https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028556#comment-13028556
 ] 

Siying Dong commented on HIVE-2146:
-----------------------------------

for 2) the possibility that it can't be sampled is more likely to be the case 
that CombineHiveInputformat.getSplits() finally calls super.getSplits() for 
some reasons. In those cases, the data are not sampled at all. 
Another possible is that, for example, two alias of the MapReduce job include 
the same directory. We can't sample it then.

For 1) and 3), I think about it more. I'll remove the extra bytesPerReducer 
added. The worst case is that we run one less reducer. Shouldn't be so bad.

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries 
> like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be 
> small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do 
> it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to