[ https://issues.apache.org/jira/browse/HIVE-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920808#action_12920808 ]
Joydeep Sen Sarma commented on HIVE-1620: ----------------------------------------- i agree that the speed efficiency may be worth the tradeoff in consistency. as you say - the messaging is critical. can we gate this feature on a new hive option that makes the user conscious of this tradeoff? regarding the cleanup - please look at jobClose method in FileSinkOperator (I think). if the hive client is still functioning at the time the job fails - we can make an attempt to clean things up there (assuming that the file names are unique - which i am not sure about right now because we made some changes to shorten file names (that might have to be undone for this feature)). one thing we have experienced in the past is that hadoop tasks continue to do stuff even after the job is technically 'complete'. so i think while the cleanup can help the 99% use case - there will be marginal cases where the output directory gets written to when it shouldn't. so having this gated on an option would still be worthwhile IMHO (for users who cannot afford speed-accuracy tradeoff). > Patch to write directly to S3 from Hive > --------------------------------------- > > Key: HIVE-1620 > URL: https://issues.apache.org/jira/browse/HIVE-1620 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: Vaibhav Aggarwal > Assignee: Vaibhav Aggarwal > Attachments: HIVE-1620.patch > > > We want to submit a patch to Hive which allows user to write files directly > to S3. > This patch allow user to specify an S3 location as the table output location > and hence eliminates the need of copying data from HDFS to S3. > Users can run Hive queries directly over the data stored in S3. > This patch helps integrate hive with S3 better and quicker. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.