George Pachitariu created HIVE-20523:
----------------------------------------

             Summary: Improve table statistics when the table contains arrays
                 Key: HIVE-20523
                 URL: https://issues.apache.org/jira/browse/HIVE-20523
             Project: Hive
          Issue Type: Improvement
          Components: Physical Optimizer
            Reporter: George Pachitariu
            Assignee: George Pachitariu


By default, when the table contains table-stats, the value of *rawDataSize* is 
taken to estimate the table data size in the execution plan.

The problem is that rawDataSize does not contain the data size of arrays. This 
makes the table size be underestimated when arrays make most of the table size.

In those specific cases, the value of the *totalSize* is much closer to the 
truth.

In this task I propose to take the max value between *rawDataSize* and 
*totalSize*deserializationFactor*.

I don't know if this proposal will backfire in any specific cases 
(overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to