[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773046#comment-16773046 ]
Sahil Takiar commented on HIVE-20079: ------------------------------------- FYI I don't think {{block.getTotalByteSize}} provides the size of the data when loaded into memory. Talking to a few Parquet folks, no such method to get the raw data size exists. If we want to implement this patch we will have to do something similar to what ORC does - https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L601 > Populate more accurate rawDataSize for parquet format > ----------------------------------------------------- > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats > Affects Versions: 2.0.0 > Reporter: Aihua Xu > Assignee: Antal Sinkovits > Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles 1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)