Re: Hive/Tez ORC tables -- rawDataSize value

Lalitha MV Thu, 23 Jun 2016 16:58:27 -0700

Thanks for the responses Prasanth and Mich. They were helpful.

@Mich:


Output of desc formatted:

1. Textfile:

Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                1
        numRows                 1000000
        rawDataSize             471000000
        totalSize               50444668

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1

2. ORC:

Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        last_modified_by        hadoop
        last_modified_time      1466631967
        numFiles                1
        numRows                 1000000
        rawDataSize             141869803
        totalSize               142869803

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             |
        serialization.format    |

-- Lalitha

On Thu, Jun 23, 2016 at 4:14 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Please find answers inline.
>
> On Jun 23, 2016, at 3:49 PM, Lalitha MV <lalitham...@gmail.com> wrote:
>
> Hi,
>
> I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1.
>
> I created a hive table with text file size = ~141 Mb.
> show tblproperties of this table (textfile):
> numFiles        1
> numRows 1000000
> rawDataSize     141869803
> totalSize       142869803
>
> I then created a hive table, with orc compression from the above table.
> The compressed file size is ~50 Mb.
>
> show tblproperties for new table (orc):
>
> numFiles        1
> numRows 1000000
> rawDataSize     471000000
> totalSize       50444668
>
> I had two sets of questions regarding this:
>
> 1. Why is the rawDataSize so high in case of ORC table (3.3 times the text
> file size).
> How is the rawDataSize calculated in this case? (Is it the sum of each
> datatype size of the columns, multiplied the numRows)?
>
>
> Yes. That is correct. Raw data size = datatype size * numRows.
>
> 2. In Hive query plans, the estimated data size of the tables in each
> phase (map and reduce), are equal to the rawDataSize. The number of
> reducers get caluclated from this size (atleast in Tez, not in case of MR
> though). Isn't this wrong, shouldn't it pick the totalSize rather? Is there
> a way to force Hive/Tez to pick the totalSize in query plans/ or atleast
> while calculating the number of reducers?
>
>
> Unlike some lazy text formats the row/column vectors returned by ORC are
> eagerly deserialized. Also ORC by default compresses the data. So on-disk
> representation (totalSize) is not a direct reflection of how we process it
> on the memory. Because of encoding and compression, on-disk
> representation is way smaller than in-memory representation used by
> operator pipeline. That’s the reason why raw data size is a better metric
> for reducer estimation than on-disk file size. In case if the raw data
> size does not exist then the optimizer will fallback to use totalSize.
> Using totalSize for reducer estimation may overly underestimate the number
> of reducers required for compressed tables.
> On the other hand using raw data size may over estimate the number of
> reducers but Tez offsets this issue by auto reducer parallelism feature
> (hive.tez.auto.reducer.parallelism) which can downsize the number of
> reducers based on the bytes emitted by previous map stage.
>
> Thanks in advance.
>
> Cheers,
> Lalitha
>
>
>

Re: Hive/Tez ORC tables -- rawDataSize value

Reply via email to