Hi, I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1.
I created a hive table with text file size = ~141 Mb. show tblproperties of this table (textfile): numFiles 1 numRows 1000000 rawDataSize 141869803 totalSize 142869803 I then created a hive table, with orc compression from the above table. The compressed file size is ~50 Mb. show tblproperties for new table (orc): numFiles 1 numRows 1000000 rawDataSize 471000000 totalSize 50444668 I had two sets of questions regarding this: 1. Why is the rawDataSize so high in case of ORC table (3.3 times the text file size). How is the rawDataSize calculated in this case? (Is it the sum of each datatype size of the columns, multiplied the numRows)? 2. In Hive query plans, the estimated data size of the tables in each phase (map and reduce), are equal to the rawDataSize. The number of reducers get caluclated from this size (atleast in Tez, not in case of MR though). Isn't this wrong, shouldn't it pick the totalSize rather? Is there a way to force Hive/Tez to pick the totalSize in query plans/ or atleast while calculating the number of reducers? Thanks in advance. Cheers, Lalitha