Please find answers inline. On Jun 23, 2016, at 3:49 PM, Lalitha MV <lalitham...@gmail.com<mailto:lalitham...@gmail.com>> wrote:
Hi, I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1. I created a hive table with text file size = ~141 Mb. show tblproperties of this table (textfile): numFiles 1 numRows 1000000 rawDataSize 141869803 totalSize 142869803 I then created a hive table, with orc compression from the above table. The compressed file size is ~50 Mb. show tblproperties for new table (orc): numFiles 1 numRows 1000000 rawDataSize 471000000 totalSize 50444668 I had two sets of questions regarding this: 1. Why is the rawDataSize so high in case of ORC table (3.3 times the text file size). How is the rawDataSize calculated in this case? (Is it the sum of each datatype size of the columns, multiplied the numRows)? Yes. That is correct. Raw data size = datatype size * numRows. 2. In Hive query plans, the estimated data size of the tables in each phase (map and reduce), are equal to the rawDataSize. The number of reducers get caluclated from this size (atleast in Tez, not in case of MR though). Isn't this wrong, shouldn't it pick the totalSize rather? Is there a way to force Hive/Tez to pick the totalSize in query plans/ or atleast while calculating the number of reducers? Unlike some lazy text formats the row/column vectors returned by ORC are eagerly deserialized. Also ORC by default compresses the data. So on-disk representation (totalSize) is not a direct reflection of how we process it on the memory. Because of encoding and compression, on-disk representation is way smaller than in-memory representation used by operator pipeline. That’s the reason why raw data size is a better metric for reducer estimation than on-disk file size. In case if the raw data size does not exist then the optimizer will fallback to use totalSize. Using totalSize for reducer estimation may overly underestimate the number of reducers required for compressed tables. On the other hand using raw data size may over estimate the number of reducers but Tez offsets this issue by auto reducer parallelism feature (hive.tez.auto.reducer.parallelism) which can downsize the number of reducers based on the bytes emitted by previous map stage. Thanks in advance. Cheers, Lalitha