[ 
https://issues.apache.org/jira/browse/HIVE-23054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-23054:
----------------------------------
    Description: 
Store a counter in HMS column statics for the total number of bytes (raw) in 
each column.

Right now, there is no good way to merge the average column length when 
performing an INSERT statement into a table.  Right now, the code just selects 
the maximum value, however, if inserting a single records with a long length 
(128 bytes) into a table that has millions of strings with an average length of 
4, the average length for the entire data set gets boosted to 128.

{code:java}
aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), 
newData.getAvgColLen()));
{code}

https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34

Store the total raw size of all the data in each column.  Between the total raw 
size, and the average length, one can compute the real average length when 
merging the existing data and the newly inserted data.

  was:
Store a counter in HMS column statics for the total number of bytes (raw) in 
each column.

Right now, there is no good way to merge the average column length when 
performing an INSERT statement into a table.  Right now, the code just selects 
the maximum value, however, if inserting a single records with a long length 
(128 bytes) into a table that has millions of strings with an average length of 
4, the average length for the entire data set gets boosted to 128.

{code:java}
aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), 
newData.getAvgColLen()));
{code}

https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34

Store the total raw size of all the data in each column.  Between the total raw 
size, and the average length, one can compute the real average length when 
merging the exiting data and the newly inserted data.


> Capture Total Byte Size in Column Statistics
> --------------------------------------------
>
>                 Key: HIVE-23054
>                 URL: https://issues.apache.org/jira/browse/HIVE-23054
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Statistics
>            Reporter: David Mollitor
>            Priority: Major
>
> Store a counter in HMS column statics for the total number of bytes (raw) in 
> each column.
> Right now, there is no good way to merge the average column length when 
> performing an INSERT statement into a table.  Right now, the code just 
> selects the maximum value, however, if inserting a single records with a long 
> length (128 bytes) into a table that has millions of strings with an average 
> length of 4, the average length for the entire data set gets boosted to 128.
> {code:java}
> aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), 
> newData.getAvgColLen()));
> {code}
> https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34
> Store the total raw size of all the data in each column.  Between the total 
> raw size, and the average length, one can compute the real average length 
> when merging the existing data and the newly inserted data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to