[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771971#comment-16771971 ]
Antal Sinkovits commented on HIVE-20523: ---------------------------------------- Hi [~george.pachitariu] Thanks for the answer. I understood your code, this was the initial approach I was planing to do as well. :) So the issue I see is that you only implemented the write (serialize) path. But the read part (deserialize) remains as is. Let me give an example, which might put some light on what I mean. For the setup, I've applied your patch on top of master and nothing else. create table case1 (col string) stored as parquet; insert into case1 values("This is a test string"); // -> rawDataSize: 105 analyze table case1 compute statistics; // -> rawDataSize: 144 analyze table case1 compute statistics for columns; // -> rawDataSize: 1 Now if I start to mix these, things gets more interesting, because your change only calculates for the data it writes. So for example if I run these commands: create table case2 (col string) stored as parquet; insert into case2 values("This is a test string"); // -> rawDataSize: 105 analyze table case2 compute statistics for columns; // -> rawDataSize: 1 insert into case2 values("This is a test string"); // -> rawDataSize: 106 (1+105) Thats why I think, there should be a single source of truth. I've checked with the parquet team, and unfortunately, parquet (unlike ORC) doesn't provide any api on the writer side to get the total size. It's there only in the reader, because the value is internal in parquet, and only gets written, when the file is closed. So it makes sense, to use this, as our single source of truth. HIVE-20079 was done by [~aihuaxu] I don't want to take credit for this. That change moves the stat calculation from the serde to the writer, and when the writer closes the file, and parquet writes the footer, it reads from the closed file, and updates the stats. This fixes the write path. HIVE-21284 was done by me, which fixes the read portion to use the same footer value, on analyze compute statistics for columns. This way, the calculated value stays consistent, no matter which path you take. Let me know, if this makes sense or not. Thanks. > Improve table statistics for Parquet format > ------------------------------------------- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer > Reporter: George Pachitariu > Assignee: George Pachitariu > Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, > HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)