[ https://issues.apache.org/jira/browse/HIVE-27163?focusedWorklogId=861725&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-861725 ]
ASF GitHub Bot logged work on HIVE-27163: ----------------------------------------- Author: ASF GitHub Bot Created on: 12/May/23 10:55 Start Date: 12/May/23 10:55 Worklog Time Spent: 10m Work Description: simhadri-g commented on code in PR #4228: URL: https://github.com/apache/hive/pull/4228#discussion_r1192213156 ########## iceberg/iceberg-handler/src/test/results/positive/col_stats.q.out: ########## @@ -339,17 +339,16 @@ POSTHOOK: type: DESCTABLE POSTHOOK: Input: default@tbl_ice_puffin col_name a data_type int -min 1 -max 333 -num_nulls 0 -distinct_count 7 +min +max +num_nulls +distinct_count Review Comment: This part of the output corresponds to the following code snippet. ``` set hive.iceberg.stats.source=iceberg; drop table if exists tbl_ice_puffin; create external table tbl_ice_puffin(a int, b string, c int) stored by iceberg tblproperties ('format-version'='2'); insert into tbl_ice_puffin values (1, 'one', 50), (2, 'two', 51),(2, 'two', 51),(2, 'two', 51), (3, 'three', 52), (4, 'four', 53), (5, 'five', 54), (111, 'one', 55), (333, 'two', 56); explain select * from tbl_ice_puffin order by a, b, c; select * from tbl_ice_puffin order by a, b, c; select count(*) from tbl_ice_puffin ; desc formatted tbl_ice_puffin a; ``` In this case, the output of `desc formatted tbl_ice_puffin a; ` is accurate and not stale. (min = 1, max=333.) I think we should either: 1. Source the stats for desc table from puffin files for iceberg tables or. 2. Add additional logic in hms to address this. Issue Time Tracking ------------------- Worklog Id: (was: 861725) Time Spent: 5h 50m (was: 5h 40m) > Column stats are not getting published after an insert query into an external > table with custom location > -------------------------------------------------------------------------------------------------------- > > Key: HIVE-27163 > URL: https://issues.apache.org/jira/browse/HIVE-27163 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: Taraka Rama Rao Lethavadla > Assignee: Zhihua Deng > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Test case details are below > *test.q* > {noformat} > set hive.stats.column.autogather=true; > set hive.stats.autogather=true; > dfs ${system:test.dfs.mkdir} ${system:test.tmp.dir}/test; > create external table test_custom(age int, name string) stored as orc > location '/tmp/test'; > insert into test_custom select 1, 'test'; > desc formatted test_custom age;{noformat} > *test.q.out* > > > {noformat} > #### A masked pattern was here #### > PREHOOK: type: CREATETABLE > #### A masked pattern was here #### > PREHOOK: Output: database:default > PREHOOK: Output: default@test_custom > #### A masked pattern was here #### > POSTHOOK: type: CREATETABLE > #### A masked pattern was here #### > POSTHOOK: Output: database:default > POSTHOOK: Output: default@test_custom > PREHOOK: query: insert into test_custom select 1, 'test' > PREHOOK: type: QUERY > PREHOOK: Input: _dummy_database@_dummy_table > PREHOOK: Output: default@test_custom > POSTHOOK: query: insert into test_custom select 1, 'test' > POSTHOOK: type: QUERY > POSTHOOK: Input: _dummy_database@_dummy_table > POSTHOOK: Output: default@test_custom > POSTHOOK: Lineage: test_custom.age SIMPLE [] > POSTHOOK: Lineage: test_custom.name SIMPLE [] > PREHOOK: query: desc formatted test_custom age > PREHOOK: type: DESCTABLE > PREHOOK: Input: default@test_custom > POSTHOOK: query: desc formatted test_custom age > POSTHOOK: type: DESCTABLE > POSTHOOK: Input: default@test_custom > col_name age > data_type int > min > max > num_nulls > distinct_count > avg_col_len > max_col_len > num_trues > num_falses > bit_vector > comment from deserializer{noformat} > As we can see from desc formatted output, column stats were not populated > -- This message was sent by Atlassian Jira (v8.20.10#820010)