[ https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ashutosh Chauhan updated HIVE-12491: ------------------------------------ Summary: Improve ndv heuristic for functions (was: Column Statistics: 3 attribute join on a 2-source table is off) > Improve ndv heuristic for functions > ----------------------------------- > > Key: HIVE-12491 > URL: https://issues.apache.org/jira/browse/HIVE-12491 > Project: Hive > Issue Type: Bug > Components: Statistics > Affects Versions: 1.3.0, 2.0.0 > Reporter: Gopal V > Assignee: Ashutosh Chauhan > Attachments: HIVE-12491.2.patch, HIVE-12491.3.patch, > HIVE-12491.4.patch, HIVE-12491.5.patch, HIVE-12491.WIP.patch, HIVE-12491.patch > > > The eased out denominator has to detect duplicate row-stats from different > attributes. > {code} > select account_id from customers c, customer_activation ca > where c.customer_id = ca.customer_id > and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt) > and year(ca.dt) between year('2013-12-26') and year('2013-12-26') > {code} > {code} > private Long getEasedOutDenominator(List<Long> distinctVals) { > // Exponential back-off for NDVs. > // 1) Descending order sort of NDVs > // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * .... > Collections.sort(distinctVals, Collections.reverseOrder()); > long denom = distinctVals.get(0); > for (int i = 1; i < distinctVals.size(); i++) { > denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << > i))); > } > return denom; > } > {code} > This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 > of which are derived from the same column. > {code} > Reduce Output Operator (RS_12) > key expressions: _col0 (type: bigint), year(_col2) (type: int), > month(_col2) (type: int) > sort order: +++ > Map-reduce partition columns: _col0 (type: bigint), year(_col2) > (type: int), month(_col2) (type: int) > value expressions: _col1 (type: bigint) > Join Operator (JOIN_13) > condition map: > Inner Join 0 to 1 > keys: > 0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) > (type: int) > 1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) > (type: int) > outputColumnNames: _col3 > {code} > So the eased out denominator is off by a factor of 30,000 or so, causing OOMs > in map-joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332)