Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Edward Capriolo Tue, 14 Apr 2015 10:54:45 -0700

That is too many partitions. Way to much overhead in anything that has that
many partitions.


On Tue, Apr 14, 2015 at 12:53 PM, Tianqi Tong <tt...@brightedge.com> wrote:

>  Hi Slava and Ferdinand,
>
> Thanks for the reply! Later when I was looking at the hive.log, I found
> Hive was indeed calculating the partition stats, and the log looks like:
>
> ….
>
> 2015-04-14 09:38:21,146 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
> stats fast for: parquet_table
>
> 2015-04-14 09:38:21,147 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to
> 5533480
>
> 2015-04-14 09:38:44,511 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
> stats fast for: parquet_table
>
> 2015-04-14 09:38:44,512 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 66246
>
> 2015-04-14 09:39:07,554 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
> stats fast for: parquet_table
>
> 2015-04-14 09:39:07,555 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 418925
>
> ….
>
>
>
> One interesting thing is, it's getting slower and slower. Right after I
> launched the job, it took less than 1s to calculate for one partition. Now
> it's taking 20+s for each one.
>
> I tried hive.stats.autogather=false, but somehow it didn't seem to work. I
> also ended up hard coding a little bit to the Hive source code.
>
>
>
> In my case, I have around 40000 partitions with one file (varies from 1M
> to 1G) in each of them. Now it's been 4 days and the first job I launched
> is still not done yet, with partition stats.
>
>
>
> Thanks
>
> Tianqi Tong
>
>
>
> *From:* Slava Markeyev [mailto:slava.marke...@upsight.com]
> *Sent:* Monday, April 13, 2015 11:00 PM
> *To:* user@hive.apache.org
> *Cc:* Sergio Pena
> *Subject:* Re: [Hive] Slow Loading Data Process with Parquet over 30k
> Partitions
>
>
>
> This is something I've encountered when doing ETL with hive and having it
> create 10's of thousands partitions. The issue is each partition needs to
> be added to the metastore and this is an expensive operation to perform. My
> work around was adding a flag to hive that optionally disables the
> metastore partition creation step. This may not be a solution for everyone
> as that table then has no partitions and you would have to run msck repair
> but depending on your use case, you may just want the data in hdfs.
>
> If there is interest in having this be an option I'll make a ticket and
> submit the patch.
>
> -Slava
>
>
>
> On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A <cheng.a...@intel.com>
> wrote:
>
> Hi Tianqi,
>
> Can you attach hive.log as more detailed information?
>
> +Sergio
>
>
>
> Yours,
>
> Ferdinand Xu
>
>
>
> *From:* Tianqi Tong [mailto:tt...@brightedge.com]
> *Sent:* Friday, April 10, 2015 1:34 AM
> *To:* user@hive.apache.org
> *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k
> Partitions
>
>
>
> Hello Hive,
>
> I'm a developer using Hive to process TB level data, and I'm having some
> difficulty loading the data to table.
>
> I have 2 tables now:
>
>
>
> -- table_1:
>
> CREATE EXTERNAL TABLE `table_1`(
>
>   `keyword` string,
>
>   `domain` string,
>
>   `url` string
>
>   )
>
> PARTITIONED BY (yearmonth INT, partition1 STRING)
>
> STORED AS RCfile
>
>
>
> -- table_2:
>
> CREATE EXTERNAL TABLE `table_2`(
>
>   `keyword` string,
>
>   `domain` string,
>
>   `url` string
>
>   )
>
> PARTITIONED BY (yearmonth INT, partition2 STRING)
>
> STORED AS Parquet
>
>
>
> I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with
> dynamic partitioning, and the number of partitions grows dramatically from
> 1500 to 40k (because I want to use something else as partitioning).
>
> The mapreduce job was fine.
>
> Somehow the process stucked at " Loading data to table default.table_2
> (yearmonth=null, domain_prefix=null) ", and I've been waiting for hours.
>
>
>
> Is this expected when we have 40k partitions?
>
>
>
> --------------------------------------------------------------
>
> Refs - Here are the parameters that I used:
>
> export HADOOP_HEAPSIZE=16384
>
> set PARQUET_FILE_SIZE=268435456;
>
> set parquet.block.size=268435456;
>
> set dfs.blocksize=268435456;
>
> set parquet.compression=SNAPPY;
>
> SET hive.exec.dynamic.partition.mode=nonstrict;
>
> SET hive.exec.max.dynamic.partitions=500000;
>
> SET hive.exec.max.dynamic.partitions.pernode=50000;
>
> SET hive.exec.max.created.files=1000000;
>
>
>
>
>
> Thank you very much!
>
> Tianqi Tong
>
>
>
>
> --
>
> Slava Markeyev | Engineering | Upsight
>

Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Reply via email to