[Hive] Slow Loading Data Process with Parquet over 30k Partitions

Tianqi Tong Thu, 09 Apr 2015 10:37:40 -0700

Hello Hive,
I'm a developer using Hive to process TB level data, and I'm having some 
difficulty loading the data to table.
I have 2 tables now:


-- table_1:
CREATE EXTERNAL TABLE `table_1`(
  `keyword` string,
  `domain` string,
  `url` string
  )
PARTITIONED BY (yearmonth INT, partition1 STRING)
STORED AS RCfile

-- table_2:
CREATE EXTERNAL TABLE `table_2`(
  `keyword` string,
  `domain` string,
  `url` string
  )
PARTITIONED BY (yearmonth INT, partition2 STRING)
STORED AS Parquet

I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with dynamic 
partitioning, and the number of partitions grows dramatically from 1500 to 40k 
(because I want to use something else as partitioning).
The mapreduce job was fine.
Somehow the process stucked at " Loading data to table default.table_2 
(yearmonth=null, domain_prefix=null) ", and I've been waiting for hours.

Is this expected when we have 40k partitions?

--------------------------------------------------------------
Refs - Here are the parameters that I used:
export HADOOP_HEAPSIZE=16384
set PARQUET_FILE_SIZE=268435456;
set parquet.block.size=268435456;
set dfs.blocksize=268435456;
set parquet.compression=SNAPPY;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=500000;
SET hive.exec.max.dynamic.partitions.pernode=50000;
SET hive.exec.max.created.files=1000000;


Thank you very much!
Tianqi Tong

[Hive] Slow Loading Data Process with Parquet over 30k Partitions

Reply via email to