Rajesh Balamohan created HIVE-24296:
---------------------------------------
Summary: NDV adjusted twice causing reducer task underestimation
Key: HIVE-24296
URL: https://issues.apache.org/jira/browse/HIVE-24296
Project: Hive
Issue Type: Improvement
Reporter: Rajesh Balamohan
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2550]
{{StatsRuleProcFactory::updateColStats}}::
{code:java}
if (ratio <= 1.0) {
newDV = (long) Math.ceil(ratio * oldDV);
}
cs.setCountDistint(newDV);
{code}
Though RelHive* has the latest statistics, it is adjusted again
{{StatsRuleProcFactory::updateColStats}} and it is done at linear scale.
Because of this, downstream vertex gets lesser number of tasks causing latency
issues.
E.g Q10 + TPCDS @10 TB scale. Attaching a snippet of "explain analyze" which
shows stats underestimation.
"Reducer 13" is underestimated 10x, when compared to runtime details. Projected
NDV from RelHive* was around 65989699.
However, due to the ratio calculation in StatsRuleProcFactory, it gets
readjusted to ((948122598/14291978461) * 65989699)) ~= 4377723.
It would be good to remove static readjustment in StatsRuleProcFactory.
{noformat}
Edges:
Map 10 <- Map 9 (BROADCAST_EDGE)
Map 12 <- Map 9 (BROADCAST_EDGE)
Map 2 <- Map 7 (BROADCAST_EDGE)
Map 8 <- Map 9 (BROADCAST_EDGE), Reducer 6 (BROADCAST_EDGE)
Reducer 11 <- Map 10 (SIMPLE_EDGE)
Reducer 13 <- Map 12 (SIMPLE_EDGE)
Reducer 3 <- Map 1 (BROADCAST_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE), Map 8
(CUSTOM_SIMPLE_EDGE), Reducer 11 (BROADCAST_EDGE), Reducer 13 (BROADCAST_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
Reducer 6 <- Map 2 (CUSTOM_SIMPLE_EDGE)
Map 12
Map Operator Tree:
TableScan
alias: catalog_sales
filterExpr: cs_ship_customer_sk is not null (type: boolean)
Statistics: Num rows: 14327953968/552509183 Data size:
228959459440 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: cs_ship_customer_sk is not null (type: boolean)
Statistics: Num rows: 14291978461/551122492 Data size:
228384573968 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: cs_ship_customer_sk (type: bigint),
cs_sold_date_sk (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 14291978461/551122492 Data size:
228384573968 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col1 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0
input vertices:
1 Map 9
Statistics: Num rows: 948122598/551122492 Data size:
7297899376 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: _col0 (type: bigint)
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 126954025/61576194 Data size:
977191880 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: bigint)
null sort order: a
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 126954025/61576194 Data size:
977191880 Basic stats: COMPLETE Column stats: COMPLETE
...
...
Reducer 13
Execution mode: vectorized, llap
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 4377725/40166690 Data size: 33696280
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: true (type: boolean), _col0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4377725/40166690 Data size: 51207180
Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col1 (type: bigint)
null sort order: a
sort order: +
Map-reduce partition columns: _col1 (type: bigint)
Statistics: Num rows: 4377725/40166690 Data size: 51207180
Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: boolean)
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)