That is too many partitions. Way to much overhead in anything that has that many partitions.
On Tue, Apr 14, 2015 at 12:53 PM, Tianqi Tong <tt...@brightedge.com> wrote: > Hi Slava and Ferdinand, > > Thanks for the reply! Later when I was looking at the hive.log, I found > Hive was indeed calculating the partition stats, and the log looks like: > > …. > > 2015-04-14 09:38:21,146 WARN [main]: hive.log > (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition > stats fast for: parquet_table > > 2015-04-14 09:38:21,147 WARN [main]: hive.log > (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to > 5533480 > > 2015-04-14 09:38:44,511 WARN [main]: hive.log > (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition > stats fast for: parquet_table > > 2015-04-14 09:38:44,512 WARN [main]: hive.log > (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 66246 > > 2015-04-14 09:39:07,554 WARN [main]: hive.log > (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition > stats fast for: parquet_table > > 2015-04-14 09:39:07,555 WARN [main]: hive.log > (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 418925 > > …. > > > > One interesting thing is, it's getting slower and slower. Right after I > launched the job, it took less than 1s to calculate for one partition. Now > it's taking 20+s for each one. > > I tried hive.stats.autogather=false, but somehow it didn't seem to work. I > also ended up hard coding a little bit to the Hive source code. > > > > In my case, I have around 40000 partitions with one file (varies from 1M > to 1G) in each of them. Now it's been 4 days and the first job I launched > is still not done yet, with partition stats. > > > > Thanks > > Tianqi Tong > > > > *From:* Slava Markeyev [mailto:slava.marke...@upsight.com] > *Sent:* Monday, April 13, 2015 11:00 PM > *To:* user@hive.apache.org > *Cc:* Sergio Pena > *Subject:* Re: [Hive] Slow Loading Data Process with Parquet over 30k > Partitions > > > > This is something I've encountered when doing ETL with hive and having it > create 10's of thousands partitions. The issue is each partition needs to > be added to the metastore and this is an expensive operation to perform. My > work around was adding a flag to hive that optionally disables the > metastore partition creation step. This may not be a solution for everyone > as that table then has no partitions and you would have to run msck repair > but depending on your use case, you may just want the data in hdfs. > > If there is interest in having this be an option I'll make a ticket and > submit the patch. > > -Slava > > > > On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A <cheng.a...@intel.com> > wrote: > > Hi Tianqi, > > Can you attach hive.log as more detailed information? > > +Sergio > > > > Yours, > > Ferdinand Xu > > > > *From:* Tianqi Tong [mailto:tt...@brightedge.com] > *Sent:* Friday, April 10, 2015 1:34 AM > *To:* user@hive.apache.org > *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k > Partitions > > > > Hello Hive, > > I'm a developer using Hive to process TB level data, and I'm having some > difficulty loading the data to table. > > I have 2 tables now: > > > > -- table_1: > > CREATE EXTERNAL TABLE `table_1`( > > `keyword` string, > > `domain` string, > > `url` string > > ) > > PARTITIONED BY (yearmonth INT, partition1 STRING) > > STORED AS RCfile > > > > -- table_2: > > CREATE EXTERNAL TABLE `table_2`( > > `keyword` string, > > `domain` string, > > `url` string > > ) > > PARTITIONED BY (yearmonth INT, partition2 STRING) > > STORED AS Parquet > > > > I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with > dynamic partitioning, and the number of partitions grows dramatically from > 1500 to 40k (because I want to use something else as partitioning). > > The mapreduce job was fine. > > Somehow the process stucked at " Loading data to table default.table_2 > (yearmonth=null, domain_prefix=null) ", and I've been waiting for hours. > > > > Is this expected when we have 40k partitions? > > > > -------------------------------------------------------------- > > Refs - Here are the parameters that I used: > > export HADOOP_HEAPSIZE=16384 > > set PARQUET_FILE_SIZE=268435456; > > set parquet.block.size=268435456; > > set dfs.blocksize=268435456; > > set parquet.compression=SNAPPY; > > SET hive.exec.dynamic.partition.mode=nonstrict; > > SET hive.exec.max.dynamic.partitions=500000; > > SET hive.exec.max.dynamic.partitions.pernode=50000; > > SET hive.exec.max.created.files=1000000; > > > > > > Thank you very much! > > Tianqi Tong > > > > > -- > > Slava Markeyev | Engineering | Upsight >