[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.

Yongzhi Chen (JIRA) Wed, 10 Jun 2015 05:44:26 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580470#comment-14580470
 ]


Yongzhi Chen commented on HIVE-10880:
-------------------------------------

It is hadoop issue. Hive uses hadoop-core-1.2.1.jar, it has a bug. I did a 
test, I rename a 2.6 version hadoop-core jar to hadoop-core-1.2.1.jar and run 
HIVE 1.2, the data distribute to all the buckets. 
But I can not use hadoop-core 2.6 to build hive, it will make Hive Shims 
Common(may be more) fail. So we need lower version to support hadoop-1 (shims 
0.20s) 
[~xuefuz], you said that you read something regarding hadoop bug on localjob 
which does not respect number of reducers. Could you find the jira number? 
Thanks
For now I think my fix is a good workaround for the issue. The buckets are not 
balanced but they are fully functional. 
How do you think? 


> The bucket number is not respected in insert overwrite.
> -------------------------------------------------------
>
>                 Key: HIVE-10880
>                 URL: https://issues.apache.org/jira/browse/HIVE-10880
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: Yongzhi Chen
>            Assignee: Yongzhi Chen
>            Priority: Blocker
>         Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, 
> HIVE-10880.3.patch
>
>
> When hive.enforce.bucketing is true, the bucket number defined in the table 
> is no longer respected in current master and 1.2. This is a regression.
> Reproduce:
> {noformat}
> CREATE TABLE IF NOT EXISTS buckettestinput( 
> data string 
> ) 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> CREATE TABLE IF NOT EXISTS buckettestoutput1( 
> data string 
> )CLUSTERED BY(data) 
> INTO 2 BUCKETS 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> CREATE TABLE IF NOT EXISTS buckettestoutput2( 
> data string 
> )CLUSTERED BY(data) 
> INTO 2 BUCKETS 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> Then I inserted the following data into the "buckettestinput" table
> firstinsert1 
> firstinsert2 
> firstinsert3 
> firstinsert4 
> firstinsert5 
> firstinsert6 
> firstinsert7 
> firstinsert8 
> secondinsert1 
> secondinsert2 
> secondinsert3 
> secondinsert4 
> secondinsert5 
> secondinsert6 
> secondinsert7 
> secondinsert8
> set hive.enforce.bucketing = true; 
> set hive.enforce.sorting=true;
> insert overwrite table buckettestoutput1 
> select * from buckettestinput where data like 'first%';
> set hive.auto.convert.sortmerge.join=true; 
> set hive.optimize.bucketmapjoin = true; 
> set hive.optimize.bucketmapjoin.sortedmerge = true; 
> select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data);
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use 
> bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number 
> of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 
> (state=42000,code=10141)
> {noformat}
> The related debug information related to insert overwrite:
> {noformat}
> 0: jdbc:hive2://localhost:10000> insert overwrite table buckettestoutput1 
> select * from buckettestinput where data like 'first%'insert overwrite table 
> buckettestoutput1 
> 0: jdbc:hive2://localhost:10000> ;
> select * from buckettestinput where data like ' 
> first%';
> INFO  : Number of reduce tasks determined at compile time: 2
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapred.reduce.tasks=<number>
> INFO  : Job running in-process (local Hadoop)
> INFO  : 2015-06-01 11:09:29,650 Stage-1 map = 86%,  reduce = 100%
> INFO  : Ended Job = job_local107155352_0001
> INFO  : Loading data to table default.buckettestoutput1 from 
> file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-10000
> INFO  : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, 
> totalSize=52, rawDataSize=48]
> No rows affected (1.692 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10880) The bucket number is not respected in insert overwrite.

Reply via email to