[jira] [Updated] (HIVE-16295) Add support for using Hadoop's OutputCommitter

Sahil Takiar (JIRA) Fri, 24 Mar 2017 13:23:51 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sahil Takiar updated HIVE-16295:
--------------------------------
    Description: 
Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
{{NullOutputCommitter}} and uses its own commit logic spread across 
{{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.

The Hadoop community is building an {{OutputCommitter}} that integrates with 
S3Guard and does a safe, coordinate commit of data on S3 inside individual 
tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
there would be a lot of benefits to Hive-on-S3:

* Data is only written once; directly committing data at a task level means no 
renames are necessary
* The commit is done safely, in a coordinated manner; duplicate tasks (from 
task retries or speculative execution) should not step on each other

  was:
Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
{{NullOutputCommitter}} and uses its own commit logic spread across 
{{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.

The Hadoop community is building a {{OutputCommitter}} that integrates with 
S3Guard and does a safe, coordinate commit of data on S3 inside individual 
tasks. If Hive can integrate with this new {{OutputCommitter}} there would be a 
lot of benefits to Hive-on-S3:

* Data is only written once; directly committing data at a task level means no 
renames are necessary
* The commit is done safely, in a coordinated manner; duplicate tasks (from 
task retries or speculative execution) should not step on each other
* Data is written within each task, so everything in does in parallel


> Add support for using Hadoop's OutputCommitter
> ----------------------------------------------
>
>                 Key: HIVE-16295
>                 URL: https://issues.apache.org/jira/browse/HIVE-16295
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16295) Add support for using Hadoop's OutputCommitter

Reply via email to