Hi Slava, We would be interested in reviewing your patch. Can you please provide more details?
Is there any other way to disable the partition creation step? Thanks, Chris On 4/13/15 10:59 PM, Slava Markeyev wrote:
This is something I've encountered when doing ETL with hive and having it create 10's of thousands partitions. The issue is each partition needs to be added to the metastore and this is an expensive operation to perform. My work around was adding a flag to hive that optionally disables the metastore partition creation step. This may not be a solution for everyone as that table then has no partitions and you would have to run msck repair but depending on your use case, you may just want the data in hdfs. If there is interest in having this be an option I'll make a ticket and submit the patch. -Slava On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A <cheng.a...@intel.com <mailto:cheng.a...@intel.com>> wrote: Hi Tianqi,____ Can you attach hive.log as more detailed information?____ +Sergio____ __ __ Yours,____ Ferdinand Xu____ __ __ *From:*Tianqi Tong [mailto:tt...@brightedge.com <mailto:tt...@brightedge.com>] *Sent:* Friday, April 10, 2015 1:34 AM *To:* user@hive.apache.org <mailto:user@hive.apache.org> *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k Partitions____ __ __ Hello Hive,____ I'm a developer using Hive to process TB level data, and I'm having some difficulty loading the data to table.____ I have 2 tables now:____ __ __ -- table_1:____ CREATE EXTERNAL TABLE `table_1`(____ `keyword` string,____ `domain` string,____ `url` string____ )____ PARTITIONED BY (yearmonth INT, partition1 STRING)____ STORED AS RCfile____ __ __ -- table_2:____ CREATE EXTERNAL TABLE `table_2`(____ `keyword` string,____ `domain` string,____ `url` string____ )____ PARTITIONED BY (yearmonth INT, partition2 STRING)____ STORED AS Parquet____ __ __ I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with dynamic partitioning, and the number of partitions grows dramatically from 1500 to 40k (because I want to use something else as partitioning).____ The mapreduce job was fine.____ Somehow the process stucked at " Loading data to table default.table_2 (yearmonth=null, domain_prefix=null) ", and I've been waiting for hours.____ __ __ Is this expected when we have 40k partitions?____ __ __ --------------------------------------------------------------____ Refs - Here are the parameters that I used:____ export HADOOP_HEAPSIZE=16384____ set PARQUET_FILE_SIZE=268435456;____ set parquet.block.size=268435456;____ set dfs.blocksize=268435456;____ set parquet.compression=SNAPPY;____ SET hive.exec.dynamic.partition.mode=nonstrict;____ SET hive.exec.max.dynamic.partitions=500000;____ SET hive.exec.max.dynamic.partitions.pernode=50000;____ SET hive.exec.max.created.files=1000000;____ __ __ __ __ Thank you very much!____ Tianqi Tong____ -- Slava Markeyev | Engineering | Upsight