[ https://issues.apache.org/jira/browse/FLINK-25200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478648#comment-17478648 ]
Anton Kalashnikov commented on FLINK-25200: ------------------------------------------- I have done some tests for comparing `upload files` vs `copy files` to S3. You can take a look at the results below. I actually don’t see a big difference between the upload and the copy but it is worth noticing that when I did the upload I didn’t load something from the local disk I already had prepared data in memory so for real upload cases the time will be even worse. For the implementation of my test, I used `FSDataOutputStream`(as I understand, under the hood it uses putObject) for uploading the data and `AmazonS3#copyObject` for the copy. I also noticed that `copyObject` is more sensitive to `socketTimeout` since this request waits for finish operation on S3 side which can take a while. So it is important to take into account that we should configure it properly if we decide to implement copy for S3. I didn't do it, but perhaps, it also makes sense to check the case when we want to upload/copy from many machines. As I understand, it is exactly our case. ---- *512MB* : Median(upload | copy) :: 4537 | 4739 Mean(upload | copy) :: 5175 | 4571 Min(upload | copy) :: 4365 | 3209 Max(upload | copy) :: 16679 | 7223 Raw upload :: [16679, 4687, 4554, 4675, 4469, 4708, 4666, 4953, 4392, 4505, 4469, 4483, 4600, 4641, 4365, 4508, 4444, 4521, 4395, 4800] Raw copy :: [4882, 4893, 4717, 5443, 5643, 3755, 3500, 3411, 4923, 5678, 3334, 5346, 4364, 3209, 7223, 4930, 4762, 4631, 3212, 3572] ---- *1024MB* : Median(upload | copy) :: 9227 | 8003 Mean(upload | copy) :: 9161 | 8143 Min(upload | copy) :: 8597 | 6150 Max(upload | copy) :: 9769 | 12075 Raw upload :: [9719, 9577, 9471, 9156, 9415, 9372, 9769, 8631, 9530, 9256, 9278, 9422, 8690, 8718, 8597, 9198, 8636, 9076, 8995, 8723] Raw copy :: [9975, 9338, 10134, 6655, 6351, 6150, 6715, 6403, 9591, 12075, 9391, 9336, 6570, 6598, 6459, 9552, 9292, 9427, 6310, 6552] ---- *1536MB* : Median(upload | copy) :: 13432 | 14243 Mean(upload | copy) :: 13474 | 18221 Min(upload | copy) :: 12590 | 9184 Max(upload | copy) :: 15073 | 80669 Raw upload :: [14362, 13249, 13547, 13117, 13496, 14310, 13615, 13448, 13253, 15073, 13598, 12905, 13367, 12590, 13076, 13275, 12676, 13577, 13416, 13537] Raw copy :: [9593, 14258, 16861, 15293, 9399, 14349, 14297, 38705, 9509, 38107, 9184, 9343, 14229, 10011, 9747, 80669, 10264, 9704, 14516, 16395] ---- *2048MB* : Median(upload | copy) :: 17905 | 13381 Mean(upload | copy) :: 18133 | 15410 Min(upload | copy) :: 16714 | 11990 Max(upload | copy) :: 22116 | 20242 Raw upload :: [17859, 18576, 17697, 18226, 18620, 17108, 17881, 22116, 18486, 17573, 18444, 17785, 18088, 17653, 16714, 18182, 19455, 17149, 17129, 17929] Raw copy :: [19397, 20242, 18637, 19174, 19832, 12752, 12954, 13136, 17303, 12760, 13685, 13609, 13153, 12921, 11990, 19949, 18535, 13046, 12127, 13007] ---- CC: [~danny.cranmer] , [~pnowojski] > Implement duplicating for s3 filesystem > --------------------------------------- > > Key: FLINK-25200 > URL: https://issues.apache.org/jira/browse/FLINK-25200 > Project: Flink > Issue Type: Sub-task > Components: FileSystems > Reporter: Dawid Wysakowicz > Priority: Major > Fix For: 1.15.0 > > > We can use https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObject.html -- This message was sent by Atlassian Jira (v8.20.1#820001)