Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Slava Markeyev Fri, 17 Apr 2015 16:16:12 -0700

I've created HIVE-10385 and attached a patch. Unit tests to come.

-Slava


On Fri, Apr 17, 2015 at 1:34 PM, Chris Roblee <chr...@unity3d.com> wrote:

> Hi Slava,
>
> We would be interested in reviewing your patch.  Can you please provide
> more details?
>
> Is there any other way to disable the partition creation step?
>
> Thanks,
> Chris
>
> On 4/13/15 10:59 PM, Slava Markeyev wrote:
>
>> This is something I've encountered when doing ETL with hive and having it
>> create 10's of thousands partitions. The issue
>> is each partition needs to be added to the metastore and this is an
>> expensive operation to perform. My work around was
>> adding a flag to hive that optionally disables the metastore partition
>> creation step. This may not be a solution for
>> everyone as that table then has no partitions and you would have to run
>> msck repair but depending on your use case, you
>> may just want the data in hdfs.
>>
>> If there is interest in having this be an option I'll make a ticket and
>> submit the patch.
>>
>> -Slava
>>
>> On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A <cheng.a...@intel.com
>> <mailto:cheng.a...@intel.com>> wrote:
>>
>>     Hi Tianqi,____
>>
>>     Can you attach hive.log as more detailed information?____
>>
>>     +Sergio____
>>
>>     __ __
>>
>>     Yours,____
>>
>>     Ferdinand Xu____
>>
>>     __ __
>>
>>     *From:*Tianqi Tong [mailto:tt...@brightedge.com <mailto:
>> tt...@brightedge.com>]
>>     *Sent:* Friday, April 10, 2015 1:34 AM
>>     *To:* user@hive.apache.org <mailto:user@hive.apache.org>
>>     *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k
>> Partitions____
>>
>>     __ __
>>
>>     Hello Hive,____
>>
>>     I'm a developer using Hive to process TB level data, and I'm having
>> some difficulty loading the data to table.____
>>
>>     I have 2 tables now:____
>>
>>     __ __
>>
>>     -- table_1:____
>>
>>     CREATE EXTERNAL TABLE `table_1`(____
>>
>>        `keyword` string,____
>>
>>        `domain` string,____
>>
>>        `url` string____
>>
>>        )____
>>
>>     PARTITIONED BY (yearmonth INT, partition1 STRING)____
>>
>>     STORED AS RCfile____
>>
>>     __ __
>>
>>     -- table_2:____
>>
>>     CREATE EXTERNAL TABLE `table_2`(____
>>
>>        `keyword` string,____
>>
>>        `domain` string,____
>>
>>        `url` string____
>>
>>        )____
>>
>>     PARTITIONED BY (yearmonth INT, partition2 STRING)____
>>
>>     STORED AS Parquet____
>>
>>     __ __
>>
>>     I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1
>> with dynamic partitioning, and the number of
>>     partitions grows dramatically from 1500 to 40k (because I want to use
>> something else as partitioning).____
>>
>>     The mapreduce job was fine.____
>>
>>     Somehow the process stucked at " Loading data to table
>> default.table_2 (yearmonth=null, domain_prefix=null) ", and
>>     I've been waiting for hours.____
>>
>>     __ __
>>
>>     Is this expected when we have 40k partitions?____
>>
>>     __ __
>>
>>     --------------------------------------------------------------____
>>
>>     Refs - Here are the parameters that I used:____
>>
>>     export HADOOP_HEAPSIZE=16384____
>>
>>     set PARQUET_FILE_SIZE=268435456;____
>>
>>     set parquet.block.size=268435456;____
>>
>>     set dfs.blocksize=268435456;____
>>
>>     set parquet.compression=SNAPPY;____
>>
>>     SET hive.exec.dynamic.partition.mode=nonstrict;____
>>
>>     SET hive.exec.max.dynamic.partitions=500000;____
>>
>>     SET hive.exec.max.dynamic.partitions.pernode=50000;____
>>
>>     SET hive.exec.max.created.files=1000000;____
>>
>>     __ __
>>
>>     __ __
>>
>>     Thank you very much!____
>>
>>     Tianqi Tong____
>>
>>
>>
>>
>> --
>>
>> Slava Markeyev | Engineering | Upsight
>>
>>
>


-- 

Slava Markeyev | Engineering | Upsight
<http://www.linkedin.com/in/slavamarkeyev>
<http://www.linkedin.com/in/slavamarkeyev>

Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Reply via email to