[ 
https://issues.apache.org/jira/browse/HIVE-17010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075852#comment-16075852
 ] 

Chao Sun edited comment on HIVE-17010 at 7/6/17 3:27 AM:
---------------------------------------------------------

Ah I see. Sometimes the stats estimation could generate negative values, in 
which case Hive will use {{Long.MAX_VALUE}} for both # of rows and data size. 
One case I observed previously:
{code}
not ((P1 or P2) or P3)
{code}
When no column stats are available, Hive will simply divide the # of input rows 
by 2 for each predicate evaluation. Suppose the total input rows is 10, then 
{{P1}}, {{P2}} and {{P3}} will yield 5 respectively. Operator {{or}} adds value 
from both sides so the expression {{((P1 or P2) or P3)}} generates 30 rows. The 
operator {{not}}, on the other hand, will subtract the value of its associated 
expression from the total input rows. Therefore in the end you will get {{10 - 
30 = -20}}.

For the solution you proposed, I'm inclined to use {{StatsUtils.safeAdd}}, but 
either way should be fine.


was (Author: csun):
Ah I see. Sometimes the stats estimation could generate negative values, in 
which case Hive will use {{Long.MAX_VALUE}} for both # of rows and data size 
could be. One case I observed previously:
{code}
not ((P1 or P2) or P3)
{code}
When no column stats are available, Hive will simply divide the # of input rows 
by 2 for each predicate evaluation. Suppose the total input rows is 10, then 
{{P1}}, {{P2}} and {{P3}} will yield 5 respectively. Operator {{or}} adds value 
from both sides so the expression {{((P1 or P2) or P3)}} generates 30 rows. The 
operator {{not}}, on the other hand, will subtract the value of its associated 
expression from the total input rows. Therefore in the end you will get {{10 - 
30 = -20}}.

For the solution you proposed, I'm inclined to use {{StatsUtils.safeAdd}}, but 
either way should be fine.

> Fix the overflow problem of Long type in SetSparkReducerParallelism
> -------------------------------------------------------------------
>
>                 Key: HIVE-17010
>                 URL: https://issues.apache.org/jira/browse/HIVE-17010
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-17010.1.patch
>
>
> We use 
> [numberOfByteshttps://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L129]
>  to collect the numberOfBytes of sibling of specified RS. We use Long type 
> and it happens overflow when the data is too big. After happening this 
> situation, the parallelism is decided by 
> [sparkMemoryAndCores.getSecond()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L184]
>  if spark.dynamic.allocation.enabled is true, sparkMemoryAndCores.getSecond 
> is a dymamic value which is decided by spark runtime. For example, the value 
> of sparkMemoryAndCores.getSecond is 5 or 15 randomly. There is possibility 
> that the value may be 1. The may problem here is the overflow of addition of 
> Long type.  You can reproduce the overflow problem by following code
> {code}
>     public static void main(String[] args) {
>       long a1= 9223372036854775807L;
>       long a2=1022672;
>       long res = a1+a2;
>       System.out.println(res);  //-9223372036853753137
>       BigInteger b1= BigInteger.valueOf(a1);
>       BigInteger b2 = BigInteger.valueOf(a2);
>       BigInteger bigRes = b1.add(b2);
>       System.out.println(bigRes); //9223372036855798479
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to