Naresh P R created HIVE-28746:
---------------------------------

             Summary: Provide an optional config to autogather column stats 
only for columns mentioned in CREATE TABLE STATEMENT
                 Key: HIVE-28746
                 URL: https://issues.apache.org/jira/browse/HIVE-28746
             Project: Hive
          Issue Type: New Feature
            Reporter: Naresh P R


Hive by default autogather column stats(hive.stats.column.autogather=true) on 
all ETL jobs. This is increasing PART_COL_STATS table size. My cluster has 350g 
PART_COL_STATS data in backend db.

As part of CREATE TABLE STATEMENT, we can have an OPTIONAL config to 
enable/disable autogather column stats for few specific columns rather than 
collecting it automatically for a complete table.

Syntax can be as follows:

 
{code:java}
CREATE TABLE [TABLE_NAME] ( 
COL1 [DATATYPE] 'COMMENT' [NO_STATS|NEED_STATS],
...
);

ALTER TABLE [TABLE_NAME] SET AUTOGATHER STATISTICS FOR COLUMNS 
[COMMA_SEPARATED_COL_NAMES] [NO_STATS|NEED_STATS];{code}
In ETL flow, disable collecting complete table stats by default and let user 
enable stats only for required columns.

Users can identify columns that would be part of join condition, group by, DPP, 
filter condition etc and enable only for those columns. This will let ETL to 
collect stats only for few required columns on wider table with a lot of 
partitions.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to