[ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033094#comment-15033094
 ] 

Prasanth Jayachandran commented on HIVE-12491:
----------------------------------------------

NDVs for UDFs is currently assumed to be worst case which is the number of 
rows. Ideally, for built-in UDFs, if the UDFType is non-deterministic then we 
should assume the above worst case (ex. UDFRand) as we cannot estimate the 
output NDV. But if the UDF is deterministic like UDFMonth then we should 
instead use the NDV of the column referenced in the UDF. 

> Column Statistics: 3 attribute join on a 2-source table is off
> --------------------------------------------------------------
>
>                 Key: HIVE-12491
>                 URL: https://issues.apache.org/jira/browse/HIVE-12491
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Gopal V
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-12491.WIP.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List<Long> distinctVals) {
>       // Exponential back-off for NDVs.
>       // 1) Descending order sort of NDVs
>       // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
>       Collections.sort(distinctVals, Collections.reverseOrder());
>       long denom = distinctVals.get(0);
>       for (int i = 1; i < distinctVals.size(); i++) {
>         denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>       }
>       return denom;
>     }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
>         Reduce Output Operator (RS_12)
>           key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>           sort order: +++
>           Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>           value expressions: _col1 (type: bigint)
>           Join Operator (JOIN_13)
>             condition map:
>                  Inner Join 0 to 1
>             keys:
>               0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>               1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
>             outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to