[ 
https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=552861&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-552861
 ]

ASF GitHub Bot logged work on HIVE-24764:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Feb/21 08:42
            Start Date: 16/Feb/21 08:42
    Worklog Time Spent: 10m 
      Work Description: rbalamohan closed pull request #1967:
URL: https://github.com/apache/hive/pull/1967


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 552861)
    Time Spent: 20m  (was: 10m)

> insert overwrite on a partition resets row count stats in other partitions
> --------------------------------------------------------------------------
>
>                 Key: HIVE-24764
>                 URL: https://issues.apache.org/jira/browse/HIVE-24764
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> After insert overwrite on a partition, stats on other partitions are messed 
> up. Subsequent queries end up with plans with PARTIAL stats. In certain 
> cases, this leads to suboptimal query plans.
> {noformat}
> drop table if exists test_stats;
> drop table if exists test_stats_2;
> create table test_stats(i int, j bigint);
> create table test_stats_2(i int) partitioned by (j bigint);
> insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> -- select * from test_stats;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   <null>
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is not null;
> -- After executing this statement, stat gets messed up.
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is null;
> -- select * from test_stats_2;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   <null>
> -- This would return "PARTIAL" stats instead of "COMPLETE"
> explain select i, count(*) as c from test_stats_2 group by i order by c desc 
> limit 10;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>       DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: test_stats_2
>                   Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                   Select Operator
>                     expressions: i (type: int)
>                     outputColumnNames: i
>                     Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                     Group By Operator
>                       aggregations: count()
>                       keys: i (type: int)
>                       minReductionHashAggr: 0.99
>                       mode: hash
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 125 Data size: 1500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 125 Data size: 1500 Basic 
> stats: PARTIAL Column stats: COMPLETE
>                         value expressions: _col1 (type: bigint)
>             Execution mode: vectorized, llap
>             LLAP IO: may be used (ACID table)
>         Reducer 2 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
> Column stats: COMPLETE
>                 Top N Key Operator
>                   sort order: -
>                   keys: _col1 (type: bigint)
>                   null sort order: a
>                   Statistics: Num rows: 62 Data size: 744 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                   top n: 10
>                   Reduce Output Operator
>                     key expressions: _col1 (type: bigint)
>                     null sort order: a
>                     sort order: -
>                     Statistics: Num rows: 62 Data size: 744 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                     TopN Hash Memory Usage: 0.04
>                     value expressions: _col0 (type: int)
>         Reducer 3 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 
> (type: bigint)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
> Column stats: COMPLETE
>                 Limit
>                   Number of rows: 10
>                   Statistics: Num rows: 10 Data size: 120 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 10 Data size: 120 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: 10
>       Processor Tree:
>         ListSink
> {noformat}
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsAggregator.java#L138
> FSStats should return "null" or "" if partitions are not present in its list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to