[ 
https://issues.apache.org/jira/browse/HIVE-10866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583880#comment-14583880
 ] 

Xuefu Zhang commented on HIVE-10866:
------------------------------------

[~ychena], I'm not sure if I missed anything, but #1, we need to have a flag 
that allows user to go back to previous behavior (not hitting this error 
message). #2: we need to make sure ACID continues to work.

I don't think we need to check whether the table is empty. User is supposed to 
user "insert overwrite" instead. If they know what they are doing, they can use 
that above mentioned flag to have the old behavior.

> Give a error when client try to insert into bucketed table for SMB
> ------------------------------------------------------------------
>
>                 Key: HIVE-10866
>                 URL: https://issues.apache.org/jira/browse/HIVE-10866
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: Yongzhi Chen
>            Assignee: Yongzhi Chen
>         Attachments: HIVE-10866.1.patch, HIVE-10866.2.patch, 
> HIVE-10866.3.patch, HIVE-10866.4.patch
>
>
> Currently, hive does not support appends(insert into) bucketed table, see 
> open jira HIVE-3608. When insert into such table, the data will be 
> "corrupted" and not fit for sort merge bucket mapjoin. 
> We need find a way to prevent client from inserting into such table. Or at 
> least give a warning.
> Reproduce:
> {noformat}
> CREATE TABLE IF NOT EXISTS buckettestoutput1( 
> data string 
> )CLUSTERED BY(data) 
> INTO 2 BUCKETS 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> CREATE TABLE IF NOT EXISTS buckettestoutput2( 
> data string 
> )CLUSTERED BY(data) 
> INTO 2 BUCKETS 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> set hive.enforce.bucketing = true; 
> set hive.enforce.sorting=true;
> insert into table buckettestoutput1 select code from sample_07 where 
> total_emp < 134354250 limit 10;
> After this first insert, I did:
> set hive.auto.convert.sortmerge.join=true; 
> set hive.optimize.bucketmapjoin = true; 
> set hive.optimize.bucketmapjoin.sortedmerge = true; 
> set hive.auto.convert.sortmerge.join.noconditionaltask=true;
> 0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join 
> buckettestoutput2 b on (a.data=b.data);
> +-------+-------+
> | data  | data  |
> +-------+-------+
> +-------+-------+
> So select works fine. 
> Second insert:
> 0: jdbc:hive2://localhost:10000> insert into table buckettestoutput1 select 
> code from sample_07 where total_emp >= 134354250 limit 10;
> No rows affected (61.235 seconds)
> Then select:
> 0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join 
> buckettestoutput2 b on (a.data=b.data);
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use 
> bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number 
> of buckets for table buckettestoutput1 is 2, whereas the number of files is 4 
> (state=42000,code=10141)
> 0: jdbc:hive2://localhost:10000>
> {noformat}
> Insert into empty table or partition will be fine, but insert into the 
> non-empty one (after second insert in the reproduce), the SMB mapjoin will 
> throw an error. We should not let second insert succeed when user explicitly 
> want to enforce bucketing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to