George Pachitariu created HIVE-20523: ----------------------------------------
Summary: Improve table statistics when the table contains arrays Key: HIVE-20523 URL: https://issues.apache.org/jira/browse/HIVE-20523 Project: Hive Issue Type: Improvement Components: Physical Optimizer Reporter: George Pachitariu Assignee: George Pachitariu By default, when the table contains table-stats, the value of *rawDataSize* is taken to estimate the table data size in the execution plan. The problem is that rawDataSize does not contain the data size of arrays. This makes the table size be underestimated when arrays make most of the table size. In those specific cases, the value of the *totalSize* is much closer to the truth. In this task I propose to take the max value between *rawDataSize* and *totalSize*deserializationFactor*. I don't know if this proposal will backfire in any specific cases (overestimating the size of tables). -- This message was sent by Atlassian JIRA (v7.6.3#76005)