[ https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sahil Takiar updated HIVE-16295: -------------------------------- Description: Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a {{NullOutputCommitter}} and uses its own commit logic spread across {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}. The Hadoop community is building an {{OutputCommitter}} that integrates with S3Guard and does a safe, coordinate commit of data on S3 inside individual tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} there would be a lot of benefits to Hive-on-S3: * Data is only written once; directly committing data at a task level means no renames are necessary * The commit is done safely, in a coordinated manner; duplicate tasks (from task retries or speculative execution) should not step on each other was: Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a {{NullOutputCommitter}} and uses its own commit logic spread across {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}. The Hadoop community is building a {{OutputCommitter}} that integrates with S3Guard and does a safe, coordinate commit of data on S3 inside individual tasks. If Hive can integrate with this new {{OutputCommitter}} there would be a lot of benefits to Hive-on-S3: * Data is only written once; directly committing data at a task level means no renames are necessary * The commit is done safely, in a coordinated manner; duplicate tasks (from task retries or speculative execution) should not step on each other * Data is written within each task, so everything in does in parallel > Add support for using Hadoop's OutputCommitter > ---------------------------------------------- > > Key: HIVE-16295 > URL: https://issues.apache.org/jira/browse/HIVE-16295 > Project: Hive > Issue Type: Sub-task > Reporter: Sahil Takiar > Assignee: Sahil Takiar > > Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a > {{NullOutputCommitter}} and uses its own commit logic spread across > {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}. > The Hadoop community is building an {{OutputCommitter}} that integrates with > S3Guard and does a safe, coordinate commit of data on S3 inside individual > tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} > there would be a lot of benefits to Hive-on-S3: > * Data is only written once; directly committing data at a task level means > no renames are necessary > * The commit is done safely, in a coordinated manner; duplicate tasks (from > task retries or speculative execution) should not step on each other -- This message was sent by Atlassian JIRA (v6.3.15#6346)