[ 
https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15501920#comment-15501920
 ] 

Rajesh Balamohan edited comment on HIVE-14776 at 9/19/16 12:51 AM:
-------------------------------------------------------------------

Have you tried with "--hiveconf fs.trash.interval=0" setting if you would like 
to avoid sending to trash folder (In HDFS it is much cheaper operation, but in 
S3 this can be expensive based on the amount of data being moved to trash.)?

S3A writes to localsystem and does a upload only at the end of close() call. So 
internally it incurs 2 copies. There is fast S3A outputstream which tries to 
write data to buffer and stream it, but that has got its own set of issues and 
often OOMs due to memory management issues. HADOOP-13560 tries to address large 
file uploads. 




was (Author: rajesh.balamohan):
Have you tried with "--hiveconf fs.trash.interval=0" setting?

> Skip 'distcp' call when copying data from HDSF to S3
> ----------------------------------------------------
>
>                 Key: HIVE-14776
>                 URL: https://issues.apache.org/jira/browse/HIVE-14776
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>         Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch
>
>
> Hive uses 'distcp' to copy files in parallel between HDFS encryption zones 
> when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to 
> copy. This 'distcp' is also executed when copying to S3, but it is causing 
> slower copies.
> We should not invoke distcp when copying to blobstore systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to