Hive/Tez ORC tables -- rawDataSize value

Lalitha MV Thu, 23 Jun 2016 15:50:22 -0700

Hi,

I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1.


I created a hive table with text file size = ~141 Mb.
show tblproperties of this table (textfile):
numFiles        1
numRows 1000000
rawDataSize     141869803
totalSize       142869803

I then created a hive table, with orc compression from the above table. The
compressed file size is ~50 Mb.

show tblproperties for new table (orc):

numFiles        1
numRows 1000000
rawDataSize     471000000
totalSize       50444668

I had two sets of questions regarding this:

1. Why is the rawDataSize so high in case of ORC table (3.3 times the text
file size).
How is the rawDataSize calculated in this case? (Is it the sum of each
datatype size of the columns, multiplied the numRows)?

2. In Hive query plans, the estimated data size of the tables in each phase
(map and reduce), are equal to the rawDataSize. The number of reducers get
caluclated from this size (atleast in Tez, not in case of MR though). Isn't
this wrong, shouldn't it pick the totalSize rather? Is there a way to force
Hive/Tez to pick the totalSize in query plans/ or atleast while calculating
the number of reducers?

Thanks in advance.

Cheers,
Lalitha

Hive/Tez ORC tables -- rawDataSize value

Reply via email to