[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651284#comment-15651284
 ] 

Steve Loughran commented on HIVE-14271:
---------------------------------------

Strategy 2 will eliminate one rename, which, with rename costs being O(data) is 
good. However, there's still one rename to go.

there's still the overhead of copying the data from scratch to final. This 
shouldn't be done in the client-side code, as object store COPY operations 
happen server side; they're what rename() uses. If renames of files in a 
directory are issued in parallel, then the rename can be significantly speeded 
up; this works precisely because you can hold open the HTTP connections for the 
copy calls without much cost in network traffic.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-14271
>                 URL: https://issues.apache.org/jira/browse/HIVE-14271
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to