[ https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042715#comment-17042715 ]
JithendhiraKumar edited comment on HIVE-22098 at 2/22/20 9:44 PM: ------------------------------------------------------------------ [~luguangming] has already mentioned the steps to reproduce *Scenario 1.* Here are Steps To Reproduce *Scenario 2*: Input (test_data.csv) {code:java} 0,Kurt,vulnedca...@yahoo.co.uk 1,Rolland,naej...@gmx.com 2,Cortez,blategarfi...@yahoo.com 3,Tyron,tamepro...@gmail.com 4,Matthew,wellezek...@yahoo.co.uk 5,Jeffrey,fabingeb...@comcast.net 6,Gerard,oughtou...@att.net 7,Hal,coursedma...@hotmail.com 8,Virgil,squintpr...@gmail.com 9,Hector,lewddil...@email.com {code} {code:java} CREATE TABLE `join_test_1`(`id` string, `first` string, `email` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ('bucketing_version'='1'); LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE join_test_1; CREATE TABLE `join_test_2`(`id` string, `first` string, `email` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ('bucketing_version'='2'); LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE join_test_2; Query: set mapred.reduce.tasks=2; set hive.auto.convert.join=false; SELECT * from (SELECT id from join_test_1) as tbl1 LEFT JOIN (SELECT id from join_test_2) as tbl2 on tbl1.id = tbl2.id; OutPut: (Wrong Results/Data Loss) +----------+----------+ | tbl1.id | tbl2.id | +----------+----------+ | 0 | NULL | | 2 | NULL | | 4 | NULL | | 6 | NULL | | 8 | 8 | | 1 | NULL | | 3 | NULL | | 5 | 5 | | 7 | NULL | | 9 | NULL | +----------+----------+ Expected Result: +----------+----------+ | tbl1.id | tbl2.id | +----------+----------+ | 1 | 1 | | 3 | 3 | | 7 | 7 | | 8 | 8 | | 9 | 9 | | 0 | 0 | | 2 | 2 | | 4 | 4 | | 5 | 5 | | 6 | 6 | +----------+----------+ {code} . was (Author: jithendhir92): [~luguangming] has already mentioned the steps to reproduce *Scenario 1.* Here are Steps To Reproduce *Scenario 2*: Input (test_data.csv) {code:java} 0,Kurt,vulnedca...@yahoo.co.uk 1,Rolland,naej...@gmx.com 2,Cortez,blategarfi...@yahoo.com 3,Tyron,tamepro...@gmail.com 4,Matthew,wellezek...@yahoo.co.uk 5,Jeffrey,fabingeb...@comcast.net 6,Gerard,oughtou...@att.net 7,Hal,coursedma...@hotmail.com 8,Virgil,squintpr...@gmail.com 9,Hector,lewddil...@email.com {code} {code:java} CREATE TABLE `join_test_1`(`id` string, `first` string, `email` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ('bucketing_version'='1'); LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE join_test_1; CREATE TABLE `join_test_2`(`id` string, `first` string, `email` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ('bucketing_version'='2'); LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE join_test_2; Query: set mapred.reduce.tasks=2; set hive.auto.convert.join=false; SELECT * from (SELECT id from join_test_1) as tbl1 LEFT JOIN (SELECT id from join_test_2) as tbl2 on tbl1.id = tbl2.id; OutPut: (Wrong Results/Data Loss) +----------+----------+ | tbl1.id | tbl2.id | +----------+----------+ | 0 | NULL | | 2 | NULL | | 4 | NULL | | 6 | NULL | | 8 | 8 | | 1 | NULL | | 3 | NULL | | 5 | 5 | | 7 | NULL | | 9 | NULL | +----------+----------+ Expected Result: +----------+----------+ | tbl1.id | tbl2.id | +----------+----------+ | 1 | 1 | | 3 | 3 | | 7 | 7 | | 8 | 8 | | 9 | 9 | | 0 | 0 | | 2 | 2 | | 4 | 4 | | 5 | 5 | | 6 | 6 | +----------+----------+ {code} . > Data loss occurs when multiple tables are join with different bucket_version > ---------------------------------------------------------------------------- > > Key: HIVE-22098 > URL: https://issues.apache.org/jira/browse/HIVE-22098 > Project: Hive > Issue Type: Bug > Components: Operators > Affects Versions: 3.1.0, 3.1.2 > Reporter: LuGuangMing > Assignee: LuGuangMing > Priority: Blocker > Labels: data-loss, wrongresults > Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, > join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc > > > When different bucketVersion of tables do join and no of reducers is greater > than 2, the result is incorrect (*data loss*). > *Scenario 1*: Three tables join. The temporary result data of table_a in the > first table and table_b in the second table joins result is recorded as > tmp_a_b, When it joins with the third table, the bucket_version=2 of the > table created by default after hive-3.0.0, temporary data tmp_a_b initialized > the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In > the init method, the hash algorithm of selecting join column is selected > according to bucketVersion. If bucketVersion = 2 and is not an acid > operation, it will acquired the new algorithm of hash. Otherwise, the old > algorithm of hash is acquired. Because of the inconsistency of the algorithm > of hash, the partition of data allocation caused are different. At stage of > Reducer, Data with the same key can not be paired resulting in data loss. > *Scenario 2*: create two test tables, create table > table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES > ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) > TBLPROPERTIES ('bucketing_version'='2'); > when use table_bucketversion_1 to join table_bucketversion_2, partial result > data will be loss due to bucketVerison is different. > -- This message was sent by Atlassian Jira (v8.3.4#803005)