Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Chris Roblee Fri, 17 Apr 2015 13:37:28 -0700

Hi Slava,

We would be interested in reviewing your patch.  Can you please provide more 
details?


Is there any other way to disable the partition creation step?

Thanks,
Chris

On 4/13/15 10:59 PM, Slava Markeyev wrote:

This is something I've encountered when doing ETL with hive and having it 
create 10's of thousands partitions. The issue
is each partition needs to be added to the metastore and this is an expensive 
operation to perform. My work around was
adding a flag to hive that optionally disables the metastore partition creation 
step. This may not be a solution for
everyone as that table then has no partitions and you would have to run msck 
repair but depending on your use case, you
may just want the data in hdfs.

If there is interest in having this be an option I'll make a ticket and submit 
the patch.

-Slava

On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A <cheng.a...@intel.com 
<mailto:cheng.a...@intel.com>> wrote:

    Hi Tianqi,____

    Can you attach hive.log as more detailed information?____

    +Sergio____

    __ __

    Yours,____

    Ferdinand Xu____

    __ __

    *From:*Tianqi Tong [mailto:tt...@brightedge.com 
<mailto:tt...@brightedge.com>]
    *Sent:* Friday, April 10, 2015 1:34 AM
    *To:* user@hive.apache.org <mailto:user@hive.apache.org>
    *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k 
Partitions____

    __ __

    Hello Hive,____

    I'm a developer using Hive to process TB level data, and I'm having some 
difficulty loading the data to table.____

    I have 2 tables now:____

    __ __

    -- table_1:____

    CREATE EXTERNAL TABLE `table_1`(____

       `keyword` string,____

       `domain` string,____

       `url` string____

       )____

    PARTITIONED BY (yearmonth INT, partition1 STRING)____

    STORED AS RCfile____

    __ __

    -- table_2:____

    CREATE EXTERNAL TABLE `table_2`(____

       `keyword` string,____

       `domain` string,____

       `url` string____

       )____

    PARTITIONED BY (yearmonth INT, partition2 STRING)____

    STORED AS Parquet____

    __ __

    I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with 
dynamic partitioning, and the number of
    partitions grows dramatically from 1500 to 40k (because I want to use 
something else as partitioning).____

    The mapreduce job was fine.____

    Somehow the process stucked at " Loading data to table default.table_2 
(yearmonth=null, domain_prefix=null) ", and
    I've been waiting for hours.____

    __ __

    Is this expected when we have 40k partitions?____

    __ __

    --------------------------------------------------------------____

    Refs - Here are the parameters that I used:____

    export HADOOP_HEAPSIZE=16384____

    set PARQUET_FILE_SIZE=268435456;____

    set parquet.block.size=268435456;____

    set dfs.blocksize=268435456;____

    set parquet.compression=SNAPPY;____

    SET hive.exec.dynamic.partition.mode=nonstrict;____

    SET hive.exec.max.dynamic.partitions=500000;____

    SET hive.exec.max.dynamic.partitions.pernode=50000;____

    SET hive.exec.max.created.files=1000000;____

    __ __

    __ __

    Thank you very much!____

    Tianqi Tong____




--

Slava Markeyev | Engineering | Upsight

Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Reply via email to