Re: Hive/Tez ORC tables -- rawDataSize value

Prasanth Jayachandran Thu, 23 Jun 2016 16:14:54 -0700

Please find answers inline.

On Jun 23, 2016, at 3:49 PM, Lalitha MV 
<lalitham...@gmail.com<mailto:lalitham...@gmail.com>> wrote:


Hi,

I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1.

I created a hive table with text file size = ~141 Mb.
show tblproperties of this table (textfile):
numFiles        1
numRows 1000000
rawDataSize     141869803
totalSize       142869803

I then created a hive table, with orc compression from the above table. The 
compressed file size is ~50 Mb.

show tblproperties for new table (orc):

numFiles        1
numRows 1000000
rawDataSize     471000000
totalSize       50444668

I had two sets of questions regarding this:

1. Why is the rawDataSize so high in case of ORC table (3.3 times the text file 
size).
How is the rawDataSize calculated in this case? (Is it the sum of each datatype 
size of the columns, multiplied the numRows)?


Yes. That is correct. Raw data size = datatype size * numRows.

2. In Hive query plans, the estimated data size of the tables in each phase 
(map and reduce), are equal to the rawDataSize. The number of reducers get 
caluclated from this size (atleast in Tez, not in case of MR though). Isn't 
this wrong, shouldn't it pick the totalSize rather? Is there a way to force 
Hive/Tez to pick the totalSize in query plans/ or atleast while calculating the 
number of reducers?


Unlike some lazy text formats the row/column vectors returned by ORC are 
eagerly deserialized. Also ORC by default compresses the data. So on-disk 
representation (totalSize) is not a direct reflection of how we process it on 
the memory. Because of encoding and compression, on-disk
representation is way smaller than in-memory representation used by operator 
pipeline. That’s the reason why raw data size is a better metric
for reducer estimation than on-disk file size. In case if the raw data size 
does not exist then the optimizer will fallback to use totalSize.
Using totalSize for reducer estimation may overly underestimate the number of 
reducers required for compressed tables.
On the other hand using raw data size may over estimate the number of reducers 
but Tez offsets this issue by auto reducer parallelism feature 
(hive.tez.auto.reducer.parallelism) which can downsize the number of reducers 
based on the bytes emitted by previous map stage.

Thanks in advance.

Cheers,
Lalitha

Re: Hive/Tez ORC tables -- rawDataSize value

Reply via email to