[ 
https://issues.apache.org/jira/browse/HIVE-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920788#action_12920788
 ] 

Richard Cole commented on HIVE-1620:
------------------------------------

Hi Joydeep,

The patch as proposed makes no attempt to cleanup the output directory in the 
case that jobflow failed. 

If I write to a table or directory into Amazon S3 and the Hadoop job ultimately 
fails then the directory contents will have been modified. Previous results 
will have been removed. Concequently it is not easy to preserve the property 
that Hive statement failure implies to no change to the destination. Given that 
the destination may change even though the statement fails, how important do 
you consider it that the result be no records, rather than a partial set of 
records?

I know that with HDFS you attempt to achieve atomicity by doing a directory 
move at the end of the job. However Amazon S3 doesn't have an atomic directory 
move so this isn't possible. Writing directly to S3 gives a large efficiency 
gain without making the situation worse than it is today. It is important to 
message recognise however the different semantics of the two file stores.

If you think it is important to output empty results rather than partial 
results we can look into that. Where do you think is the best place in Hive to 
react to the failure of a job for example by cleaning up spurious output from 
successful task attempts?

regards,

Richard.

> Patch to write directly to S3 from Hive
> ---------------------------------------
>
>                 Key: HIVE-1620
>                 URL: https://issues.apache.org/jira/browse/HIVE-1620
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Vaibhav Aggarwal
>            Assignee: Vaibhav Aggarwal
>         Attachments: HIVE-1620.patch
>
>
> We want to submit a patch to Hive which allows user to write files directly 
> to S3.
> This patch allow user to specify an S3 location as the table output location 
> and hence eliminates the need  of copying data from HDFS to S3.
> Users can run Hive queries directly over the data stored in S3.
> This patch helps integrate hive with S3 better and quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to