[ https://issues.apache.org/jira/browse/HIVE-10880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580470#comment-14580470 ]
Yongzhi Chen commented on HIVE-10880: ------------------------------------- It is hadoop issue. Hive uses hadoop-core-1.2.1.jar, it has a bug. I did a test, I rename a 2.6 version hadoop-core jar to hadoop-core-1.2.1.jar and run HIVE 1.2, the data distribute to all the buckets. But I can not use hadoop-core 2.6 to build hive, it will make Hive Shims Common(may be more) fail. So we need lower version to support hadoop-1 (shims 0.20s) [~xuefuz], you said that you read something regarding hadoop bug on localjob which does not respect number of reducers. Could you find the jira number? Thanks For now I think my fix is a good workaround for the issue. The buckets are not balanced but they are fully functional. How do you think? > The bucket number is not respected in insert overwrite. > ------------------------------------------------------- > > Key: HIVE-10880 > URL: https://issues.apache.org/jira/browse/HIVE-10880 > Project: Hive > Issue Type: Bug > Affects Versions: 1.2.0 > Reporter: Yongzhi Chen > Assignee: Yongzhi Chen > Priority: Blocker > Attachments: HIVE-10880.1.patch, HIVE-10880.2.patch, > HIVE-10880.3.patch > > > When hive.enforce.bucketing is true, the bucket number defined in the table > is no longer respected in current master and 1.2. This is a regression. > Reproduce: > {noformat} > CREATE TABLE IF NOT EXISTS buckettestinput( > data string > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE IF NOT EXISTS buckettestoutput1( > data string > )CLUSTERED BY(data) > INTO 2 BUCKETS > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE IF NOT EXISTS buckettestoutput2( > data string > )CLUSTERED BY(data) > INTO 2 BUCKETS > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > Then I inserted the following data into the "buckettestinput" table > firstinsert1 > firstinsert2 > firstinsert3 > firstinsert4 > firstinsert5 > firstinsert6 > firstinsert7 > firstinsert8 > secondinsert1 > secondinsert2 > secondinsert3 > secondinsert4 > secondinsert5 > secondinsert6 > secondinsert7 > secondinsert8 > set hive.enforce.bucketing = true; > set hive.enforce.sorting=true; > insert overwrite table buckettestoutput1 > select * from buckettestinput where data like 'first%'; > set hive.auto.convert.sortmerge.join=true; > set hive.optimize.bucketmapjoin = true; > set hive.optimize.bucketmapjoin.sortedmerge = true; > select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use > bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number > of buckets for table buckettestoutput1 is 2, whereas the number of files is 1 > (state=42000,code=10141) > {noformat} > The related debug information related to insert overwrite: > {noformat} > 0: jdbc:hive2://localhost:10000> insert overwrite table buckettestoutput1 > select * from buckettestinput where data like 'first%'insert overwrite table > buckettestoutput1 > 0: jdbc:hive2://localhost:10000> ; > select * from buckettestinput where data like ' > first%'; > INFO : Number of reduce tasks determined at compile time: 2 > INFO : In order to change the average load for a reducer (in bytes): > INFO : set hive.exec.reducers.bytes.per.reducer=<number> > INFO : In order to limit the maximum number of reducers: > INFO : set hive.exec.reducers.max=<number> > INFO : In order to set a constant number of reducers: > INFO : set mapred.reduce.tasks=<number> > INFO : Job running in-process (local Hadoop) > INFO : 2015-06-01 11:09:29,650 Stage-1 map = 86%, reduce = 100% > INFO : Ended Job = job_local107155352_0001 > INFO : Loading data to table default.buckettestoutput1 from > file:/user/hive/warehouse/buckettestoutput1/.hive-staging_hive_2015-06-01_11-09-28_166_3109203968904090801-1/-ext-10000 > INFO : Table default.buckettestoutput1 stats: [numFiles=1, numRows=4, > totalSize=52, rawDataSize=48] > No rows affected (1.692 seconds) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)