[jira] [Updated] (HIVE-10149) Shuffle Hive data before storing in Parquet

JIRA Mon, 30 Mar 2015 15:02:35 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergio Peña updated HIVE-10149:
-------------------------------
    Description: 
Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic 
partitions to parquet because it creates too many open files at once and 
Parquet buffers an entire row group of data in memory for each open file. To 
avoid this in ORC, HIVE-6455 shuffles data for each partition so only one file 
is open at a time. We need to extend this support to Parquet and possibly the 
MR and Spark planners.

Steps to reproduce:

1. Create a table and load some data that contains many many partitions (file 
{{data.txt}} attached on this ticket).

{code}
hive> create table t1_stage(id bigint, rdate string) row format delimited 
fields terminated by ' ';

hive> load data local inpath 'data.txt' into table t1_stage;
{code}

2. Create a Parquet table, and insert partitioned data from the t1_stage table.

{noformat}
hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> create table t1_part(id bigint) partitioned by (rdate string) stored as 
parquet;

hive> insert overwrite table t1_part partition(rdate) select * from t1_stage;
Query ID = sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1427748520315_0006, Tracking URL = 
http://victory:8088/proxy/application_1427748520315_0006/
Kill Command = /opt/local/hadoop/bin/hadoop job  -kill job_1427748520315_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-30 16:37:19,065 Stage-1 map = 0%,  reduce = 0%
2015-03-30 16:37:43,947 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1427748520315_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1427748520315_0006_m_000000 (and more) from job 
job_1427748520315_0006

Task with the most failures(4): 
-----
Task ID:
  task_1427748520315_0006_m_000000

URL:
  
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1427748520315_0006&tipid=task_1427748520315_0006_m_000000
-----
Diagnostic Messages for this Task:
Error: Java heap space

FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
{noformat}

  was:
Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic 
partitions to parquet because it creates too many open files at once and 
Parquet buffers an entire row group of data in memory for each open file. To 
avoid this in ORC, HIVE-6455 shuffles data for each partition so only one file 
is open at a time. We need to extend this support to Parquet and possibly the 
MR and Spark planners.

Steps to reproduce:

1. Create a table and load some data that contains many many partitions (file 
'data.txt' attached on this ticket).

{code}
hive> create table t1_stage(id bigint, rdate string) row format delimited 
fields terminated by ' ';

hive> load data local inpath 'data.txt' into table t1_stage;
{code}

2. Create a Parquet table, and insert partitioned data from the t1_stage table.

{noformat}
hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> create table t1_part(id bigint) partitioned by (rdate string) stored as 
parquet;

hive> insert overwrite table t1_part partition(rdate) select * from t1_stage;
Query ID = sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1427748520315_0006, Tracking URL = 
http://victory:8088/proxy/application_1427748520315_0006/
Kill Command = /opt/local/hadoop/bin/hadoop job  -kill job_1427748520315_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-30 16:37:19,065 Stage-1 map = 0%,  reduce = 0%
2015-03-30 16:37:43,947 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1427748520315_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1427748520315_0006_m_000000 (and more) from job 
job_1427748520315_0006

Task with the most failures(4): 
-----
Task ID:
  task_1427748520315_0006_m_000000

URL:
  
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1427748520315_0006&tipid=task_1427748520315_0006_m_000000
-----
Diagnostic Messages for this Task:
Error: Java heap space

FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
{noformat}


> Shuffle Hive data before storing in Parquet
> -------------------------------------------
>
>                 Key: HIVE-10149
>                 URL: https://issues.apache.org/jira/browse/HIVE-10149
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 1.1.0
>            Reporter: Sergio Peña
>         Attachments: data.txt
>
>
> Hive can run into OOM (Out Of Memory) exceptions when writing many dynamic 
> partitions to parquet because it creates too many open files at once and 
> Parquet buffers an entire row group of data in memory for each open file. To 
> avoid this in ORC, HIVE-6455 shuffles data for each partition so only one 
> file is open at a time. We need to extend this support to Parquet and 
> possibly the MR and Spark planners.
> Steps to reproduce:
> 1. Create a table and load some data that contains many many partitions (file 
> {{data.txt}} attached on this ticket).
> {code}
> hive> create table t1_stage(id bigint, rdate string) row format delimited 
> fields terminated by ' ';
> hive> load data local inpath 'data.txt' into table t1_stage;
> {code}
> 2. Create a Parquet table, and insert partitioned data from the t1_stage 
> table.
> {noformat}
> hive> set hive.exec.dynamic.partition.mode=nonstrict;
> hive> create table t1_part(id bigint) partitioned by (rdate string) stored as 
> parquet;
> hive> insert overwrite table t1_part partition(rdate) select * from t1_stage;
> Query ID = sergio_20150330163713_db3afe74-d1c7-4f0d-a8f1-f2137ddb64a4
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1427748520315_0006, Tracking URL = 
> http://victory:8088/proxy/application_1427748520315_0006/
> Kill Command = /opt/local/hadoop/bin/hadoop job  -kill job_1427748520315_0006
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2015-03-30 16:37:19,065 Stage-1 map = 0%,  reduce = 0%
> 2015-03-30 16:37:43,947 Stage-1 map = 100%,  reduce = 0%
> Ended Job = job_1427748520315_0006 with errors
> Error during job, obtaining debugging information...
> Examining task ID: task_1427748520315_0006_m_000000 (and more) from job 
> job_1427748520315_0006
> Task with the most failures(4): 
> -----
> Task ID:
>   task_1427748520315_0006_m_000000
> URL:
>   
> http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1427748520315_0006&tipid=task_1427748520315_0006_m_000000
> -----
> Diagnostic Messages for this Task:
> Error: Java heap space
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
> Total MapReduce CPU Time Spent: 0 msec
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10149) Shuffle Hive data before storing in Parquet

Reply via email to