[ 
https://issues.apache.org/jira/browse/HIVE-27163?focusedWorklogId=861724&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-861724
 ]

ASF GitHub Bot logged work on HIVE-27163:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/May/23 10:47
            Start Date: 12/May/23 10:47
    Worklog Time Spent: 10m 
      Work Description: dengzhhu653 commented on code in PR #4228:
URL: https://github.com/apache/hive/pull/4228#discussion_r1192192469


##########
iceberg/iceberg-handler/src/test/results/positive/col_stats.q.out:
##########
@@ -339,17 +339,16 @@ POSTHOOK: type: DESCTABLE
 POSTHOOK: Input: default@tbl_ice_puffin
 col_name               a                   
 data_type              int                 
-min                    1                   
-max                    333                 
-num_nulls              0                   
-distinct_count         7                   
+min                                        
+max                                        
+num_nulls                                  
+distinct_count                             

Review Comment:
   The `desc formatted tbl_ice_puffin a` doesn't fetch the stats from puffin 
files though with `hive.iceberg.stats.source=iceberg`, instead it goes to 
metastore for the stats.
   
   The `tbl_ice_puffin` is an external table and recreated(inserted) multiple 
times before the desc, so this time when the table created, the legacy data 
files left behind make HMS believe that the column stats is stale(eg, cann't 
assume the row number is 0 and the min/max of column a),
   as a result stats of the insertion("values (1, 'one', 50), (2, 'two', 
51),(2, 'two', 51),(2, 'two', 51), (3, 'three', 52), (4, 'four', 53)") after 
cann't be merged in HMS.
   
   There is an `explain select * from tbl_ice_puffin order by a, b, c;` before 
the desc, as we can see, the stats stored in puffin files are not removed.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 861724)
    Time Spent: 5h 40m  (was: 5.5h)

> Column stats are not getting published after an insert query into an external 
> table with custom location
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-27163
>                 URL: https://issues.apache.org/jira/browse/HIVE-27163
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Taraka Rama Rao Lethavadla
>            Assignee: Zhihua Deng
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Test case details are below
> *test.q*
> {noformat}
> set hive.stats.column.autogather=true;
> set hive.stats.autogather=true;
> dfs ${system:test.dfs.mkdir} ${system:test.tmp.dir}/test;
> create external table test_custom(age int, name string) stored as orc 
> location '/tmp/test';
> insert into test_custom select 1, 'test';
> desc formatted test_custom age;{noformat}
> *test.q.out*
>  
>  
> {noformat}
> #### A masked pattern was here ####
> PREHOOK: type: CREATETABLE
> #### A masked pattern was here ####
> PREHOOK: Output: database:default
> PREHOOK: Output: default@test_custom
> #### A masked pattern was here ####
> POSTHOOK: type: CREATETABLE
> #### A masked pattern was here ####
> POSTHOOK: Output: database:default
> POSTHOOK: Output: default@test_custom
> PREHOOK: query: insert into test_custom select 1, 'test'
> PREHOOK: type: QUERY
> PREHOOK: Input: _dummy_database@_dummy_table
> PREHOOK: Output: default@test_custom
> POSTHOOK: query: insert into test_custom select 1, 'test'
> POSTHOOK: type: QUERY
> POSTHOOK: Input: _dummy_database@_dummy_table
> POSTHOOK: Output: default@test_custom
> POSTHOOK: Lineage: test_custom.age SIMPLE []
> POSTHOOK: Lineage: test_custom.name SIMPLE []
> PREHOOK: query: desc formatted test_custom age
> PREHOOK: type: DESCTABLE
> PREHOOK: Input: default@test_custom
> POSTHOOK: query: desc formatted test_custom age
> POSTHOOK: type: DESCTABLE
> POSTHOOK: Input: default@test_custom
> col_name                age
> data_type               int
> min
> max
> num_nulls
> distinct_count
> avg_col_len
> max_col_len
> num_trues
> num_falses
> bit_vector
> comment                 from deserializer{noformat}
> As we can see from desc formatted output, column stats were not populated
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to