[jira] [Comment Edited] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

JithendhiraKumar (Jira) Sat, 22 Feb 2020 13:45:45 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042715#comment-17042715
 ]


JithendhiraKumar edited comment on HIVE-22098 at 2/22/20 9:44 PM:
------------------------------------------------------------------

[~luguangming] has already mentioned the steps to reproduce *Scenario 1.*

Here are Steps To Reproduce *Scenario 2*: 

Input (test_data.csv)
{code:java}
0,Kurt,vulnedca...@yahoo.co.uk
1,Rolland,naej...@gmx.com
2,Cortez,blategarfi...@yahoo.com
3,Tyron,tamepro...@gmail.com
4,Matthew,wellezek...@yahoo.co.uk
5,Jeffrey,fabingeb...@comcast.net
6,Gerard,oughtou...@att.net
7,Hal,coursedma...@hotmail.com
8,Virgil,squintpr...@gmail.com
9,Hector,lewddil...@email.com
{code}
{code:java}
CREATE TABLE `join_test_1`(`id` string, `first` string, `email` string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH 
SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES 
('bucketing_version'='1');

LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE 
join_test_1;

CREATE TABLE `join_test_2`(`id` string, `first` string, `email` string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH 
SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES 
('bucketing_version'='2');

LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE 
join_test_2;

Query:
set mapred.reduce.tasks=2;
set hive.auto.convert.join=false;
SELECT * from (SELECT id from join_test_1) as tbl1 LEFT JOIN (SELECT id from 
join_test_2) as tbl2 on tbl1.id = tbl2.id;

OutPut: (Wrong Results/Data Loss)
+----------+----------+
| tbl1.id  | tbl2.id  |
+----------+----------+
| 0        | NULL     |
| 2        | NULL     |
| 4        | NULL     |
| 6        | NULL     |
| 8        | 8        |
| 1        | NULL     |
| 3        | NULL     |
| 5        | 5        |
| 7        | NULL     |
| 9        | NULL     |
+----------+----------+

Expected Result:
+----------+----------+
| tbl1.id  | tbl2.id  |
+----------+----------+
| 1        | 1        |
| 3        | 3        |
| 7        | 7        |
| 8        | 8        |
| 9        | 9        |
| 0        | 0        |
| 2        | 2        |
| 4        | 4        |
| 5        | 5        |
| 6        | 6        |
+----------+----------+
{code}
.


was (Author: jithendhir92):
[~luguangming] has already mentioned the steps to reproduce *Scenario 1.*

Here are Steps To Reproduce *Scenario 2*: 

Input (test_data.csv)

 
{code:java}
0,Kurt,vulnedca...@yahoo.co.uk
1,Rolland,naej...@gmx.com
2,Cortez,blategarfi...@yahoo.com
3,Tyron,tamepro...@gmail.com
4,Matthew,wellezek...@yahoo.co.uk
5,Jeffrey,fabingeb...@comcast.net
6,Gerard,oughtou...@att.net
7,Hal,coursedma...@hotmail.com
8,Virgil,squintpr...@gmail.com
9,Hector,lewddil...@email.com
{code}
 

 
{code:java}
CREATE TABLE `join_test_1`(`id` string, `first` string, `email` string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH 
SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES 
('bucketing_version'='1');

LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE 
join_test_1;

CREATE TABLE `join_test_2`(`id` string, `first` string, `email` string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH 
SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES 
('bucketing_version'='2');

LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE 
join_test_2;

Query:
set mapred.reduce.tasks=2;
set hive.auto.convert.join=false;
SELECT * from (SELECT id from join_test_1) as tbl1 LEFT JOIN (SELECT id from 
join_test_2) as tbl2 on tbl1.id = tbl2.id;

OutPut: (Wrong Results/Data Loss)
+----------+----------+
| tbl1.id  | tbl2.id  |
+----------+----------+
| 0        | NULL     |
| 2        | NULL     |
| 4        | NULL     |
| 6        | NULL     |
| 8        | 8        |
| 1        | NULL     |
| 3        | NULL     |
| 5        | 5        |
| 7        | NULL     |
| 9        | NULL     |
+----------+----------+

Expected Result:
+----------+----------+
| tbl1.id  | tbl2.id  |
+----------+----------+
| 1        | 1        |
| 3        | 3        |
| 7        | 7        |
| 8        | 8        |
| 9        | 9        |
| 0        | 0        |
| 2        | 2        |
| 4        | 4        |
| 5        | 5        |
| 6        | 6        |
+----------+----------+
{code}
.

> Data loss occurs when multiple tables are join with different bucket_version
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-22098
>                 URL: https://issues.apache.org/jira/browse/HIVE-22098
>             Project: Hive
>          Issue Type: Bug
>          Components: Operators
>    Affects Versions: 3.1.0, 3.1.2
>            Reporter: LuGuangMing
>            Assignee: LuGuangMing
>            Priority: Blocker
>              Labels: data-loss, wrongresults
>         Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers is greater 
> than 2, the result is incorrect (*data loss*).
>  *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
>  when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

Reply via email to