[ https://issues.apache.org/jira/browse/HIVE-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15498177#comment-15498177 ]
Sergio Peña edited comment on HIVE-14776 at 9/17/16 4:51 AM: ------------------------------------------------------------- You're right, distcp does not use S3 as a temporary place. While debugging the code, I saw a '/user/hdfs/.Trash' directory created on S3 with data files being created, but after more investigation, I saw that there were copied by Hive when using INSERT OVERWRITE (old data being backed up). Anyway, distcp is still slow than not using distcp at all. I've no idea why. I run several tests with different file sizes (see times below when copied a file): {noformat} 1G S3 with distcp: 93s S3 with no distcp: 37s 5G S3 with distcp: 255s S3 with no distcp: 147s {noformat} INSERT ... SELECT statements are going to create several files depending on the MR jobs and HDFS block-sizes, and they're might be slower than 5G. The S3A adapter should already manage multi-part uploads using Amazon API. Probably this is why distcp + s3a are not good together? was (Author: spena): You're right, distcp does not use S3 as a temporary place. While debugging the code, I saw a '/user/hdfs/.Trash' directory created on S3 with data files being created, but after more investigation, I saw that there were copied by Hive when using INSERT OVERWRITE (old data being backed up). Anyway, distcp is still slow than not using distcp at all. I've no idea why. I run several tests with different file sizes (see times below when copied a file): {{noformat}} 1G S3 with distcp: 93s S3 with no distcp: 37s 5G S3 with distcp: 255s S3 with no distcp: 147s {{noformat}} INSERT ... SELECT statements are going to create several files depending on the MR jobs and HDFS block-sizes, and they're might be slower than 5G. The S3A adapter should already manage multi-part uploads using Amazon API. Probably this is why distcp + s3a are not good together? > Skip 'distcp' call when copying data from HDSF to S3 > ---------------------------------------------------- > > Key: HIVE-14776 > URL: https://issues.apache.org/jira/browse/HIVE-14776 > Project: Hive > Issue Type: Sub-task > Components: Hive > Reporter: Sergio Peña > Assignee: Sergio Peña > Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch > > > Hive uses 'distcp' to copy files in parallel between HDFS encryption zones > when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to > copy. This 'distcp' is also executed when copying to S3, but it is causing > slower copies. > We should not invoke distcp when copying to blobstore systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)