Steve Loughran created HADOOP-16189: ---------------------------------------
Summary: S3A copy/rename of large files to be parallelized as a multipart operation Key: HADOOP-16189 URL: https://issues.apache.org/jira/browse/HADOOP-16189 Project: Hadoop Common Issue Type: Sub-task Components: fs/s3 Affects Versions: 3.2.0 Reporter: Steve Loughran AWS docs on [copying|https://docs.aws.amazon.com/AmazonS3/latest/dev/CopyingObjectsUsingAPIs.html] * file < 5GB, can do this as a single operation * file > 5GB you MUST use multipart API. But even for files < 5GB, that's a really slow operation. And if HADOOP-16188 is to be believed, there's not enough retrying. Even if the transfer manager does swtich to multipart copies at some size, just as we do our writes in 32-64 MB blocks, we can do the same for file copy. Something like {code} l = len(src) if L < fs.s3a.block.size: single copy else: split file by blocks, initiate the upload, then execute each block copy as an operation in the S3A thread pool; once all done: complete the operation. {code} + do retries on individual blocks copied, so a failure of one to copy doesn't force retry of the whole upload. This is potentially more complex than it sounds, as * there's the need to track the ongoing copy operational state * handle failures (abort, etc) * use the if-modified/version headers to fail fast if the source file changes partway through copy * if the len(file)/fs.s3a.block.size > max-block-count, use a bigger block size * Maybe need to fall back to the classic operation Overall, what sounds simple could get complex fast, or at least a bigger piece of code. Needs to have some PoC of speedup before attempting -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org