[ 
https://issues.apache.org/jira/browse/HIVE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Mokhtar updated HIVE-8196:
----------------------------------
    Description: 
To make the best out of dynamic partition pruning joins should be on the 
partitioning columns which results in dynamically pruning the partitions from 
the fact table based on the qualifying column keys from the dimension table, 
this type of joins negatively effects on cardinality estimates with fetch 
column stats enabled.
Currently we don't have statistics for partition columns and as a result NDV is 
set to row count, doing that negatively affects the estimated join selectivity 
from the join.

Workaround is to capture statistics for partition columns or use number of 
partitions incase dynamic partitioning is used.

In StatsUtils.getColStatisticsFromExpression is where count distincts gets set 
to row count 
{code}
  if (encd.getIsPartitionColOrVirtualCol()) {

        // vitual columns
        colType = encd.getTypeInfo().getTypeName();
        countDistincts = numRows;
        oi = encd.getWritableObjectInspector();
{code}


Query used to repro the issue :
{code}
set hive.stats.fetch.column.stats=ture;
set hive.tez.dynamic.partition.pruning=true;
explain select d_date 
from store_sales, date_dim 
where 
store_sales.ss_sold_date_sk = date_dim.d_date_sk and 
date_dim.d_year = 1998;
{code}

Plan 

  was:
When calculating estimate row counts & data size during physical plan 
generation in StatsRulesProcFactory doesn't know that there will be dynamic 
partition pruning and it is hard to know how many partitions will qualify at 
runtime, as a result with Dynamic partition pruning enabled a query 32 can run 
with 570 compared to 70 tasks with dynamic partition pruning disabled and 
actual partition filters on the fact table.

The long term solution for this issue is to use the cardinality estimates from 
CBO as it takes into account join selectivity and such, estimate from CBO won't 
address the number of the tasks used for the partitioned table but they will 
address the incorrect number of tasks used for the concequent reducers where 
the majority of the slowdown is coming from.

Plan dynamic partition pruning on 
{code}
   Map 5 
            Map Operator Tree:
                TableScan
                  alias: ss
                  filterExpr: ss_store_sk is not null (type: boolean)
                  Statistics: Num rows: 550076554 Data size: 47370018896 Basic 
stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: ss_store_sk is not null (type: boolean)
                    Statistics: Num rows: 275038277 Data size: 23685009448 
Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {ss_store_sk} {ss_net_profit}
                        1 
                      keys:
                        0 ss_sold_date_sk (type: int)
                        1 d_date_sk (type: int)
                      outputColumnNames: _col6, _col21
                      input vertices:
                        1 Map 1
                      Statistics: Num rows: 302542112 Data size: 26053511168 
Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {_col21}
                          1 {s_county} {s_state}
                        keys:
                          0 _col6 (type: int)
                          1 s_store_sk (type: int)
                        outputColumnNames: _col21, _col80, _col81
                        input vertices:
                          1 Map 2
                        Statistics: Num rows: 332796320 Data size: 28658862080 
Basic stats: COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Left Semi Join 0 to 1
                          condition expressions:
                            0 {_col21} {_col80} {_col81}
                            1 
                          keys:
                            0 _col81 (type: string)
                            1 _col0 (type: string)
                          outputColumnNames: _col21, _col80, _col81
                          input vertices:
                            1 Reducer 11
                          Statistics: Num rows: 366075968 Data size: 
31524749312 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col81 (type: string), _col80 (type: 
string), _col21 (type: float)
                            outputColumnNames: _col81, _col80, _col21
                            Statistics: Num rows: 366075968 Data size: 
31524749312 Basic stats: COMPLETE Column stats: NONE
                            Group By Operator
                              aggregations: sum(_col21)
                              keys: _col81 (type: string), _col80 (type: 
string), '0' (type: string)
                              mode: hash
                              outputColumnNames: _col0, _col1, _col2, _col3
                              Statistics: Num rows: 1098227904 Data size: 
94574247936 Basic stats: COMPLETE Column stats: NONE
                              Reduce Output Operator
                                key expressions: _col0 (type: string), _col1 
(type: string), _col2 (type: string)
                                sort order: +++
                                Map-reduce partition columns: _col0 (type: 
string), _col1 (type: string), _col2 (type: string)
                                Statistics: Num rows: 1098227904 Data size: 
94574247936 Basic stats: COMPLETE Column stats: NONE
                                value expressions: _col3 (type: double)
{code}

Plan snippet with partition pruning off and explicit partition filters
{code}
    Map 5 
            Map Operator Tree:
                TableScan
                  alias: ss
                  filterExpr: ((ss_sold_date_sk is not null and ss_store_sk is 
not null) and ss_sold_date BETWEEN '1999-06-01' AND '2000-05-31') (type: 
boolean)
                  Statistics: Num rows: 110339135 Data size: 4817453454 Basic 
stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (ss_sold_date_sk is not null and ss_store_sk is 
not null) (type: boolean)
                    Statistics: Num rows: 27584784 Data size: 1204363374 Basic 
stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {ss_store_sk} {ss_net_profit}
                        1 
                      keys:
                        0 ss_sold_date_sk (type: int)
                        1 d_date_sk (type: int)
                      outputColumnNames: _col7, _col22
                      input vertices:
                        1 Map 1
                      Statistics: Num rows: 30343264 Data size: 1324799744 
Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {_col22}
                          1 {s_county} {s_state}
                        keys:
                          0 _col7 (type: int)
                          1 s_store_sk (type: int)
                        outputColumnNames: _col22, _col81, _col82
                        input vertices:
                          1 Map 2
                        Statistics: Num rows: 33377592 Data size: 1457279744 
Basic stats: COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Left Semi Join 0 to 1
                          condition expressions:
                            0 {_col22} {_col81} {_col82}
                            1 
                          keys:
                            0 _col82 (type: string)
                            1 _col0 (type: string)
                          outputColumnNames: _col22, _col81, _col82
                          input vertices:
                            1 Reducer 11
                          Statistics: Num rows: 36715352 Data size: 1603007744 
Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col82 (type: string), _col81 (type: 
string), _col22 (type: float)
                            outputColumnNames: _col82, _col81, _col22
                            Statistics: Num rows: 36715352 Data size: 
1603007744 Basic stats: COMPLETE Column stats: NONE
                            Group By Operator
                              aggregations: sum(_col22)
                              keys: _col82 (type: string), _col81 (type: 
string), '0' (type: string)
                              mode: hash
                              outputColumnNames: _col0, _col1, _col2, _col3
                              Statistics: Num rows: 36715352 Data size: 
1603007744 Basic stats: COMPLETE Column stats: NONE
                              Reduce Output Operator
                                key expressions: _col0 (type: string), _col1 
(type: string), _col2 (type: string)
                                sort order: +++
                                Map-reduce partition columns: _col0 (type: 
string), _col1 (type: string), _col2 (type: string)
                                Statistics: Num rows: 36715352 Data size: 
1603007744 Basic stats: COMPLETE Column stats: NONE
                                value expressions: _col3 (type: double)
{code}

Plan snippet with partition pruning on and fetch column stats also on , in this 
case row estimate a
{code}
      Map 5 
            Map Operator Tree:
                TableScan
                  alias: ss
                  filterExpr: ss_store_sk is not null (type: boolean)
                  Statistics: Num rows: 550076554 Data size: 47370018896 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_store_sk is not null (type: boolean)
                    Statistics: Num rows: 537120379 Data size: 4195767284 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {ss_store_sk} {ss_net_profit}
                        1 
                      keys:
                        0 ss_sold_date_sk (type: int)
                        1 d_date_sk (type: int)
                      outputColumnNames: _col6, _col21
                      input vertices:
                        1 Map 1
                      Statistics: Num rows: 36524 Data size: 292192 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {_col21}
                          1 {s_county} {s_state}
                        keys:
                          0 _col6 (type: int)
                          1 s_store_sk (type: int)
                        outputColumnNames: _col21, _col80, _col81
                        input vertices:
                          1 Map 2
                        Statistics: Num rows: 45017 Data size: 8508213 Basic 
stats: COMPLETE Column stats: COMPLETE
                        Map Join Operator
                          condition map:
                               Left Semi Join 0 to 1
                          condition expressions:
                            0 {_col21} {_col80} {_col81}
                            1 
                          keys:
                            0 _col81 (type: string)
                            1 _col0 (type: string)
                          outputColumnNames: _col21, _col80, _col81
                          input vertices:
                            1 Reducer 11
                          Statistics: Num rows: 49518 Data size: 9359035 Basic 
stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col81 (type: string), _col80 (type: 
string), _col21 (type: float)
                            outputColumnNames: _col81, _col80, _col21
                            Statistics: Num rows: 49518 Data size: 9359035 
Basic stats: COMPLETE Column stats: NONE
                            Group By Operator
                              aggregations: sum(_col21)
                              keys: _col81 (type: string), _col80 (type: 
string), '0' (type: string)
                              mode: hash
                              outputColumnNames: _col0, _col1, _col2, _col3
                              Statistics: Num rows: 148554 Data size: 28077105 
Basic stats: COMPLETE Column stats: NONE
                              Reduce Output Operator
                                key expressions: _col0 (type: string), _col1 
(type: string), _col2 (type: string)
                                sort order: +++
                                Map-reduce partition columns: _col0 (type: 
string), _col1 (type: string), _col2 (type: string)
                                Statistics: Num rows: 148554 Data size: 
28077105 Basic stats: COMPLETE Column stats: NONE
                                value expressions: _col3 (type: double)
{code}


> Joining on partition columns with fetch column stats enabled results it very 
> small CE which negatively affects query performance 
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8196
>                 URL: https://issues.apache.org/jira/browse/HIVE-8196
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 0.14.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Prasanth J
>            Priority: Critical
>              Labels: performance
>             Fix For: 0.14.0
>
>
> To make the best out of dynamic partition pruning joins should be on the 
> partitioning columns which results in dynamically pruning the partitions from 
> the fact table based on the qualifying column keys from the dimension table, 
> this type of joins negatively effects on cardinality estimates with fetch 
> column stats enabled.
> Currently we don't have statistics for partition columns and as a result NDV 
> is set to row count, doing that negatively affects the estimated join 
> selectivity from the join.
> Workaround is to capture statistics for partition columns or use number of 
> partitions incase dynamic partitioning is used.
> In StatsUtils.getColStatisticsFromExpression is where count distincts gets 
> set to row count 
> {code}
>   if (encd.getIsPartitionColOrVirtualCol()) {
>         // vitual columns
>         colType = encd.getTypeInfo().getTypeName();
>         countDistincts = numRows;
>         oi = encd.getWritableObjectInspector();
> {code}
> Query used to repro the issue :
> {code}
> set hive.stats.fetch.column.stats=ture;
> set hive.tez.dynamic.partition.pruning=true;
> explain select d_date 
> from store_sales, date_dim 
> where 
> store_sales.ss_sold_date_sk = date_dim.d_date_sk and 
> date_dim.d_year = 1998;
> {code}
> Plan 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to