[ https://issues.apache.org/jira/browse/HIVE-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aihua Xu reassigned HIVE-10149: ------------------------------- Assignee: Aihua Xu > Shuffle Hive data before storing in Parquet > ------------------------------------------- > > Key: HIVE-10149 > URL: https://issues.apache.org/jira/browse/HIVE-10149 > Project: Hive > Issue Type: Improvement > Affects Versions: 1.1.0 > Reporter: Sergio Peña > Assignee: Aihua Xu > Attachments: data.txt > > > Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic > partitions to parquet because it creates too many open files at once and > Parquet buffers an entire row group of data in memory for each open file. To > avoid this in ORC, HIVE-6455 shuffles data for each partition so only one > file is open at a time. We need to extend this support to Parquet and > possibly the MR and Spark planners. > Steps to reproduce: > 1. Create a table and load some data that contains many many partitions (file > {{data.txt}} attached on this ticket). > {code} > hive> create table t1_stage(id bigint, rdate string) row format delimited > fields terminated by ' '; > hive> load data local inpath 'data.txt' into table t1_stage; > {code} > 2. Create a Parquet table, and insert partitioned data from the t1_stage > table. > {noformat} > hive> set hive.exec.dynamic.partition.mode=nonstrict; > hive> create table t1_part(id bigint) partitioned by (rdate string) stored as > parquet; > hive> insert overwrite table t1_part partition(rdate) select * from t1_stage; > Query ID = sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4 > Total jobs = 3 > Launching Job 1 out of 3 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_1427748520315_0006, Tracking URL = > http://victory:8088/proxy/application_1427748520315_0006/ > Kill Command = /opt/local/hadoop/bin/hadoop job -kill job_1427748520315_0006 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0 > 2015-03-30 16:37:19,065 Stage-1 map = 0%, reduce = 0% > 2015-03-30 16:37:43,947 Stage-1 map = 100%, reduce = 0% > Ended Job = job_1427748520315_0006 with errors > Error during job, obtaining debugging information... > Examining task ID: task_1427748520315_0006_m_000000 (and more) from job > job_1427748520315_0006 > Task with the most failures(4): > ----- > Task ID: > task_1427748520315_0006_m_000000 > URL: > > > ----- > Diagnostic Messages for this Task: > Error: Java heap space > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.mr.MapRedTask > MapReduce Jobs Launched: > Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL > Total MapReduce CPU Time Spent: 0 msec > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)