Naresh P R created HIVE-28746: --------------------------------- Summary: Provide an optional config to autogather column stats only for columns mentioned in CREATE TABLE STATEMENT Key: HIVE-28746 URL: https://issues.apache.org/jira/browse/HIVE-28746 Project: Hive Issue Type: New Feature Reporter: Naresh P R
Hive by default autogather column stats(hive.stats.column.autogather=true) on all ETL jobs. This is increasing PART_COL_STATS table size. My cluster has 350g PART_COL_STATS data in backend db. As part of CREATE TABLE STATEMENT, we can have an OPTIONAL config to enable/disable autogather column stats for few specific columns rather than collecting it automatically for a complete table. Syntax can be as follows: {code:java} CREATE TABLE [TABLE_NAME] ( COL1 [DATATYPE] 'COMMENT' [NO_STATS|NEED_STATS], ... ); ALTER TABLE [TABLE_NAME] SET AUTOGATHER STATISTICS FOR COLUMNS [COMMA_SEPARATED_COL_NAMES] [NO_STATS|NEED_STATS];{code} In ETL flow, disable collecting complete table stats by default and let user enable stats only for required columns. Users can identify columns that would be part of join condition, group by, DPP, filter condition etc and enable only for those columns. This will let ETL to collect stats only for few required columns on wider table with a lot of partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)