[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517846#comment-16517846
 ] 

Rui Li commented on HIVE-19671:
-------------------------------

[~xuefuz], thanks for your input. I think rand(seed) may not work if the 
mapper's input is not in deterministic order. As an example, suppose a mapper 
needs to process key {{1, 2, 3, 4, 5}}. The partition in 1st attempt is as 
below:
{noformat}
key   rand(seed)
 1  ->   1
 2  ->   2
 3  ->   3
 4  ->   4
 5  ->   5
{noformat}
So there'll be 5 reducers to fetch data from this mapper. Suppose the first 4 
reducers have finished. And when the 5th reducer starts, the node hosting the 
mapper's output is lost, so the mapper is rerun. And the 2nd attempt has the 
following partition:
{noformat}
key   rand(seed)
 1  ->   1
 3  ->   2
 5  ->   3
 2  ->   4
 4  ->   5
{noformat}
Then the 5th reducer is rerun and fetches key 4, which means key 4 is 
duplicated and key 5 is lost.

To avoid the issue, we need to make sure record reader can guarantee an order 
when reading data from HDFS, and we don't use shuffling that doesn't order the 
keys, e.g. groupByKey of Spark. What do you think?

> Distribute by rand() can lead to data inconsistency
> ---------------------------------------------------
>
>                 Key: HIVE-19671
>                 URL: https://issues.apache.org/jira/browse/HIVE-19671
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Assignee: Rui Li
>            Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to