[ https://issues.apache.org/jira/browse/HIVE-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Phabricator updated HIVE-2520: ------------------------------ Attachment: HIVE-2520.D717.1.patch njain requested code review of "HIVE-2520 [jira] left semi join will duplicate data". Reviewers: JIRA HIVE-2520 CREATE TABLE sales (name STRING, id INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; CREATE TABLE things (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; The 'sales' table has data in a file: sales.txt, and the data is: Joe 2 Hank 2 The 'things' table has data int two files: things.txt and things2.txt: The content of things.txt is : 2 Tie The content of things2.txt is : 2 Tie SELECT * FROM sales LEFT SEMI JOIN things ON (sales.id = things.id); will output: Joe 2 Joe 2 Hank 2 Hank 2 so the result is wrong. In CommonJoinOperator left semi join should use " genObject(null, 0, new IntermediateObject(new ArrayList<span class="error">[numAliases]</span>, 0), true); " to generate data. but now it uses " genUniqueJoinObject(0, 0); " to generate data. This patch will solve this problem. TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D717 AFFECTED FILES data/files/things.txt data/files/sales.txt data/files/things2.txt ql/src/test/results/clientpositive/leftsemijoin.q.out ql/src/test/queries/clientpositive/leftsemijoin.q ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/1563/ Tip: use the X-Herald-Rules header to filter Herald messages in your client. > left semi join will duplicate data > ---------------------------------- > > Key: HIVE-2520 > URL: https://issues.apache.org/jira/browse/HIVE-2520 > Project: Hive > Issue Type: Bug > Affects Versions: 0.7.0 > Reporter: binlijin > Assignee: binlijin > Priority: Critical > Labels: patch > Attachments: HIVE-2520.D717.1.patch, hive-2520.2.patch, > hive-2520.patch > > > CREATE TABLE sales (name STRING, id INT) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; > CREATE TABLE things (id INT, name STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; > The 'sales' table has data in a file: sales.txt, and the data is: > Joe 2 > Hank 2 > The 'things' table has data int two files: things.txt and things2.txt: > The content of things.txt is : > 2 Tie > The content of things2.txt is : > 2 Tie > SELECT * FROM sales LEFT SEMI JOIN things ON (sales.id = things.id); > will output: > Joe 2 > Joe 2 > Hank 2 > Hank 2 > so the result is wrong. > In CommonJoinOperator left semi join should use " genObject(null, 0, new > IntermediateObject(new ArrayList[numAliases], 0), true); " to generate data. > but now it uses " genUniqueJoinObject(0, 0); " to generate data. > This patch will solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira