Thanks for the responses Prasanth and Mich. They were helpful. @Mich:
Output of desc formatted: 1. Textfile: Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 1 numRows 1000000 rawDataSize 471000000 totalSize 50444668 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 2. ORC: Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} last_modified_by hadoop last_modified_time 1466631967 numFiles 1 numRows 1000000 rawDataSize 141869803 totalSize 142869803 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim | serialization.format | -- Lalitha On Thu, Jun 23, 2016 at 4:14 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > Please find answers inline. > > On Jun 23, 2016, at 3:49 PM, Lalitha MV <lalitham...@gmail.com> wrote: > > Hi, > > I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1. > > I created a hive table with text file size = ~141 Mb. > show tblproperties of this table (textfile): > numFiles 1 > numRows 1000000 > rawDataSize 141869803 > totalSize 142869803 > > I then created a hive table, with orc compression from the above table. > The compressed file size is ~50 Mb. > > show tblproperties for new table (orc): > > numFiles 1 > numRows 1000000 > rawDataSize 471000000 > totalSize 50444668 > > I had two sets of questions regarding this: > > 1. Why is the rawDataSize so high in case of ORC table (3.3 times the text > file size). > How is the rawDataSize calculated in this case? (Is it the sum of each > datatype size of the columns, multiplied the numRows)? > > > Yes. That is correct. Raw data size = datatype size * numRows. > > 2. In Hive query plans, the estimated data size of the tables in each > phase (map and reduce), are equal to the rawDataSize. The number of > reducers get caluclated from this size (atleast in Tez, not in case of MR > though). Isn't this wrong, shouldn't it pick the totalSize rather? Is there > a way to force Hive/Tez to pick the totalSize in query plans/ or atleast > while calculating the number of reducers? > > > Unlike some lazy text formats the row/column vectors returned by ORC are > eagerly deserialized. Also ORC by default compresses the data. So on-disk > representation (totalSize) is not a direct reflection of how we process it > on the memory. Because of encoding and compression, on-disk > representation is way smaller than in-memory representation used by > operator pipeline. That’s the reason why raw data size is a better metric > for reducer estimation than on-disk file size. In case if the raw data > size does not exist then the optimizer will fallback to use totalSize. > Using totalSize for reducer estimation may overly underestimate the number > of reducers required for compressed tables. > On the other hand using raw data size may over estimate the number of > reducers but Tez offsets this issue by auto reducer parallelism feature > (hive.tez.auto.reducer.parallelism) which can downsize the number of > reducers based on the bytes emitted by previous map stage. > > Thanks in advance. > > Cheers, > Lalitha > > >