I've created HIVE-10385 and attached a patch. Unit tests to come. -Slava
On Fri, Apr 17, 2015 at 1:34 PM, Chris Roblee <chr...@unity3d.com> wrote: > Hi Slava, > > We would be interested in reviewing your patch. Can you please provide > more details? > > Is there any other way to disable the partition creation step? > > Thanks, > Chris > > On 4/13/15 10:59 PM, Slava Markeyev wrote: > >> This is something I've encountered when doing ETL with hive and having it >> create 10's of thousands partitions. The issue >> is each partition needs to be added to the metastore and this is an >> expensive operation to perform. My work around was >> adding a flag to hive that optionally disables the metastore partition >> creation step. This may not be a solution for >> everyone as that table then has no partitions and you would have to run >> msck repair but depending on your use case, you >> may just want the data in hdfs. >> >> If there is interest in having this be an option I'll make a ticket and >> submit the patch. >> >> -Slava >> >> On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A <cheng.a...@intel.com >> <mailto:cheng.a...@intel.com>> wrote: >> >> Hi Tianqi,____ >> >> Can you attach hive.log as more detailed information?____ >> >> +Sergio____ >> >> __ __ >> >> Yours,____ >> >> Ferdinand Xu____ >> >> __ __ >> >> *From:*Tianqi Tong [mailto:tt...@brightedge.com <mailto: >> tt...@brightedge.com>] >> *Sent:* Friday, April 10, 2015 1:34 AM >> *To:* user@hive.apache.org <mailto:user@hive.apache.org> >> *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k >> Partitions____ >> >> __ __ >> >> Hello Hive,____ >> >> I'm a developer using Hive to process TB level data, and I'm having >> some difficulty loading the data to table.____ >> >> I have 2 tables now:____ >> >> __ __ >> >> -- table_1:____ >> >> CREATE EXTERNAL TABLE `table_1`(____ >> >> `keyword` string,____ >> >> `domain` string,____ >> >> `url` string____ >> >> )____ >> >> PARTITIONED BY (yearmonth INT, partition1 STRING)____ >> >> STORED AS RCfile____ >> >> __ __ >> >> -- table_2:____ >> >> CREATE EXTERNAL TABLE `table_2`(____ >> >> `keyword` string,____ >> >> `domain` string,____ >> >> `url` string____ >> >> )____ >> >> PARTITIONED BY (yearmonth INT, partition2 STRING)____ >> >> STORED AS Parquet____ >> >> __ __ >> >> I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 >> with dynamic partitioning, and the number of >> partitions grows dramatically from 1500 to 40k (because I want to use >> something else as partitioning).____ >> >> The mapreduce job was fine.____ >> >> Somehow the process stucked at " Loading data to table >> default.table_2 (yearmonth=null, domain_prefix=null) ", and >> I've been waiting for hours.____ >> >> __ __ >> >> Is this expected when we have 40k partitions?____ >> >> __ __ >> >> --------------------------------------------------------------____ >> >> Refs - Here are the parameters that I used:____ >> >> export HADOOP_HEAPSIZE=16384____ >> >> set PARQUET_FILE_SIZE=268435456;____ >> >> set parquet.block.size=268435456;____ >> >> set dfs.blocksize=268435456;____ >> >> set parquet.compression=SNAPPY;____ >> >> SET hive.exec.dynamic.partition.mode=nonstrict;____ >> >> SET hive.exec.max.dynamic.partitions=500000;____ >> >> SET hive.exec.max.dynamic.partitions.pernode=50000;____ >> >> SET hive.exec.max.created.files=1000000;____ >> >> __ __ >> >> __ __ >> >> Thank you very much!____ >> >> Tianqi Tong____ >> >> >> >> >> -- >> >> Slava Markeyev | Engineering | Upsight >> >> > -- Slava Markeyev | Engineering | Upsight <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>