[ https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094128#comment-16094128 ]
liyunzhang_intel commented on HIVE-17108: ----------------------------------------- the detail reason why parquet file does not gather statistic such as "RAW DATA SIZE" automatically: when executing "INSERT OVERWRITE TABLE xxx SELECT * xxx", hive with orc will update statistics from orc footer in [FileSinkOperator#closeOp|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1060] while hive with parquet will not. OrcRecordWriter implements StatsProvidingRecordWriter. ParquetRecordWriterWrapper not implements StatsProvidingRecordWriter. But i guess even ParquetRecordWriterWrapper implements [StatsProvidingRecordWriter|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/StatsProvidingRecordWriter.java], statistics like "RAW DATA SIZE" can not be updated because org.apache.parquet.hadoop.ParquetWriter does not provide interface like getRawDataSize() or getRawCount(). > Parquet file does not gather statistic such as "RAW DATA SIZE" automatically > ----------------------------------------------------------------------------- > > Key: HIVE-17108 > URL: https://issues.apache.org/jira/browse/HIVE-17108 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang_intel > > in > [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27], > we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" > to update the statistic. > In > [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45], > we need not do that if we set hive.stats.autogather as true. -- This message was sent by Atlassian JIRA (v6.4.14#64029)