[ 
https://issues.apache.org/jira/browse/HIVE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922193#action_12922193
 ] 

Liyin Tang commented on HIVE-1723:
----------------------------------

The root cause of missing some part of join result is because the join value 
for the small table is an empty array list (or multiple empty array list). 
So everytime the jdbm will  serialize the nothing on disk for this empty array 
list.
Also the jdbm will read nothing from disk, which will cause the wrong result 
problem.
Because it still need this empty array list when doing the join work. 


> The result of left semi join is not correct
> -------------------------------------------
>
>                 Key: HIVE-1723
>                 URL: https://issues.apache.org/jira/browse/HIVE-1723
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Liyin Tang
>            Assignee: Liyin Tang
>
> In the test case semijoin.q, there is a query:
> select /*+ mapjoin(b) */ a.key from t3 a left semi join t1 b on a.key = b.key 
> sort by a.key;
> I think this query will return a wrong result if table t1 is larger than 
> 25000 different keys
> To be simple, I tried a very similar query:
> select /*+ mapjoin(b) */ a.key from test_semijoin a left semi join 
> test_semijoin b on a.key = b.key sort by a.key;
> The table of test_semijoin is like
> 0     0
> 1     1
> 2     2
> 3     3
> 4     4
> 5     5
> ...    ...
> ...          ....
> 25000   25000
> 25001   25001
> ...          ....
> ...          ....
> 25999   25999
> 26000   26000
> So we can easily estimate the correct result of this query should be the same 
> keys from table test_semijoin itsel.
> Actually, the result is only part of that: only from 0 to 24544.
> 0
> 1
> 2
> ..
> ..
> 24543
> 24544

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to