Yin Huai created HIVE-4781: ------------------------------ Summary: LEFT SEMI JOIN generates wrong results when Key: HIVE-4781 URL: https://issues.apache.org/jira/browse/HIVE-4781 Project: Hive Issue Type: Bug Affects Versions: 0.12.0 Reporter: Yin Huai Assignee: Yin Huai
Suppose that we have a query shown below {code:sql} SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); {\code} When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output. Let's say t1 is {code} key ---- 1 {\code} and t2 is {code} key ---- 1 1 1 1 {\code} When hive.join.emit.interval=1, the output of above query will be {code} 1 1 1 1 {\code} The correct result should be {code} 1 {\code} This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys. Please apply the patch 'wrong_semi_join.txt' attached below and use {code} ant test -Dtestcase=TestMinimrCliDriver -Dqfile="left_semi_join.q" -Dtest.silent=false {\code} to replay the problem. The wrong result can be found in {code} <hive_root_dir>/build/ql/test/logs/clientpositive {\code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira