Rajesh Balamohan created HIVE-24296:
---------------------------------------

             Summary: NDV adjusted twice causing reducer task underestimation
                 Key: HIVE-24296
                 URL: https://issues.apache.org/jira/browse/HIVE-24296
             Project: Hive
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2550]

 

{{StatsRuleProcFactory::updateColStats}}::
{code:java}
        if (ratio <= 1.0) {
          newDV = (long) Math.ceil(ratio * oldDV);
        }

        cs.setCountDistint(newDV);
{code}
Though RelHive* has the latest statistics, it is adjusted again 
{{StatsRuleProcFactory::updateColStats}} and it is done at linear scale.

 

Because of this, downstream vertex gets lesser number of tasks causing latency 
issues.

E.g Q10 + TPCDS @10 TB scale. Attaching a snippet of "explain analyze" which 
shows stats underestimation.

"Reducer 13" is underestimated 10x, when compared to runtime details. Projected 
NDV from RelHive* was around 65989699.

However, due to the ratio calculation in StatsRuleProcFactory, it gets 
readjusted to ((948122598/14291978461) * 65989699)) ~= 4377723.

It would be good to remove static readjustment in StatsRuleProcFactory.
{noformat}
Edges:
        Map 10 <- Map 9 (BROADCAST_EDGE)
        Map 12 <- Map 9 (BROADCAST_EDGE)
        Map 2 <- Map 7 (BROADCAST_EDGE)
        Map 8 <- Map 9 (BROADCAST_EDGE), Reducer 6 (BROADCAST_EDGE)
        Reducer 11 <- Map 10 (SIMPLE_EDGE)
        Reducer 13 <- Map 12 (SIMPLE_EDGE)
        Reducer 3 <- Map 1 (BROADCAST_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE), Map 8 
(CUSTOM_SIMPLE_EDGE), Reducer 11 (BROADCAST_EDGE), Reducer 13 (BROADCAST_EDGE)
        Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
        Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
        Reducer 6 <- Map 2 (CUSTOM_SIMPLE_EDGE)


Map 12
            Map Operator Tree:
                TableScan
                  alias: catalog_sales
                  filterExpr: cs_ship_customer_sk is not null (type: boolean)
                  Statistics: Num rows: 14327953968/552509183 Data size: 
228959459440 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: cs_ship_customer_sk is not null (type: boolean)
                    Statistics: Num rows: 14291978461/551122492 Data size: 
228384573968 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: cs_ship_customer_sk (type: bigint), 
cs_sold_date_sk (type: bigint)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 14291978461/551122492 Data size: 
228384573968 Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col1 (type: bigint)
                          1 _col0 (type: bigint)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 9
                        Statistics: Num rows: 948122598/551122492 Data size: 
7297899376 Basic stats: COMPLETE Column stats: COMPLETE
                        Group By Operator
                          keys: _col0 (type: bigint)
                          minReductionHashAggr: 0.99
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 126954025/61576194 Data size: 
977191880 Basic stats: COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            key expressions: _col0 (type: bigint)
                            null sort order: a
                            sort order: +
                            Map-reduce partition columns: _col0 (type: bigint)
                            Statistics: Num rows: 126954025/61576194 Data size: 
977191880 Basic stats: COMPLETE Column stats: COMPLETE

...
...
Reducer 13
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: bigint)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 4377725/40166690 Data size: 33696280 
Basic stats: COMPLETE Column stats: COMPLETE
                Select Operator
                  expressions: true (type: boolean), _col0 (type: bigint)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 4377725/40166690 Data size: 51207180 
Basic stats: COMPLETE Column stats: COMPLETE
                  Reduce Output Operator
                    key expressions: _col1 (type: bigint)
                    null sort order: a
                    sort order: +
                    Map-reduce partition columns: _col1 (type: bigint)
                    Statistics: Num rows: 4377725/40166690 Data size: 51207180 
Basic stats: COMPLETE Column stats: COMPLETE
                    value expressions: _col0 (type: boolean)

{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to