[
https://issues.apache.org/jira/browse/HADOOP-19295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886188#comment-17886188
]
Steve Loughran commented on HADOOP-19295:
-----------------------------------------
Now with connection request timeout set to 15m. No errors, just the IOstats (my
config always prints these;
{code}
> time bin/hadoop fs -put
> ../../../downloads/hadoop-3.4.1-RC2/hadoop-3.4.1.tar.gz
> s3a://stevel-london/hadoop-3.4.1.tar.gz
2024-09-30 19:22:10,062 [shutdown-hook-0] INFO statistics.IOStatisticsLogging
(IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics:
counters=((action_executor_acquired=15)
(action_http_head_request=6)
(audit_request_execution=45)
(audit_span_creation=8)
(files_copied=1)
(files_copied_bytes=973970699)
(files_created=1)
(files_deleted=1)
(filesystem_close=1)
(filesystem_initialization=1)
(multipart_upload_completed=1)
(multipart_upload_part_put=15)
(object_copy_requests=1)
(object_delete_objects=1)
(object_delete_request=1)
(object_list_request=3)
(object_metadata_request=6)
(object_multipart_initiated=2)
(object_put_bytes=973970699)
(object_put_request_completed=15)
(op_create=1)
(op_exists=1)
(op_get_file_status=3)
(op_get_file_status.failures=2)
(op_glob_status=1)
(op_rename=1)
(store_client_creation=3)
(store_io_request=45)
(stream_write_block_uploads=30)
(stream_write_bytes=973970699)
(stream_write_queue_duration=385122)
(stream_write_total_data=1947941398)
(stream_write_total_time=2121277));
gauges=();
minimums=((action_executor_acquired.min=0)
(action_http_head_request.min=30)
(filesystem_close.min=2086)
(filesystem_initialization.min=459)
(object_delete_request.min=45)
(object_list_request.min=33)
(object_multipart_initiated.min=133)
(op_create.min=13)
(op_exists.min=1)
(op_get_file_status.failures.min=70)
(op_get_file_status.min=38)
(op_glob_status.min=473)
(op_rename.min=2450)
(store_client_creation.min=8)
(store_io_rate_limited_duration.min=0));
maximums=((action_executor_acquired.max=83568)
(action_http_head_request.max=392)
(filesystem_close.max=2086)
(filesystem_initialization.max=459)
(object_delete_request.max=45)
(object_list_request.max=52)
(object_multipart_initiated.max=133)
(op_create.max=13)
(op_exists.max=1)
(op_get_file_status.failures.max=457)
(op_get_file_status.max=38)
(op_glob_status.max=473)
(op_rename.max=2450)
(store_client_creation.max=420)
(store_io_rate_limited_duration.max=0));
means=((action_executor_acquired.mean=(samples=30, sum=770232, mean=25674.4000))
(action_http_head_request.mean=(samples=6, sum=564, mean=94.0000))
(filesystem_close.mean=(samples=1, sum=2086, mean=2086.0000))
(filesystem_initialization.mean=(samples=1, sum=459, mean=459.0000))
(object_delete_request.mean=(samples=1, sum=45, mean=45.0000))
(object_list_request.mean=(samples=3, sum=125, mean=41.6667))
(object_multipart_initiated.mean=(samples=2, sum=260, mean=130.0000))
(op_create.mean=(samples=1, sum=13, mean=13.0000))
(op_exists.mean=(samples=1, sum=1, mean=1.0000))
(op_get_file_status.failures.mean=(samples=2, sum=527, mean=263.5000))
(op_get_file_status.mean=(samples=1, sum=38, mean=38.0000))
(op_glob_status.mean=(samples=1, sum=473, mean=473.0000))
(op_rename.mean=(samples=1, sum=2450, mean=2450.0000))
(store_client_creation.mean=(samples=3, sum=533, mean=177.6667))
(store_io_rate_limited_duration.mean=(samples=1, sum=0, mean=0.0000)));
________________________________________________________
Executed in 453.44 secs fish external
usr time 31.26 secs 87.00 micros 31.26 secs
sys time 6.51 secs 896.00 micros 6.51 secs
{code}
it did work, though it took 453 seconds and each write was queued for an
average of 25.6 seconds before the upload could be initiated
> S3A: fs.s3a.connection.request.timeout too low for large uploads over slow
> links
> --------------------------------------------------------------------------------
>
> Key: HADOOP-19295
> URL: https://issues.apache.org/jira/browse/HADOOP-19295
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.4.0, 3.4.1
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
>
> The value of {{fs.s3a.connection.request.timeout}} (default = 60s} is too low
> for large uploads over slow connections.
> I suspect something changed between the v1 and v2 SDK versions so that put
> was exempt from the normal timeouts, It is not and now surfaces in failures
> to upload 1+ GB files over slower network connections. Smailer (for example
> 128 MB) files work.
> The parallel queuing of writes in the S3ABlockOutputStream is helping create
> this problem as it queues multiple blocks at the same time, so per-block
> bandwidth becomes available/blocks ; four blocks cuts the capacity down by a
> quarter.
> The fix is straightforward: use a much bigger timeout. I'm going to propose
> 15 minutes. We need to strike a balance between upload time allocation and
> other requests timing out.
> I do worry about other consequences; we've found that timeout exception happy
> to hide the underlying causes of retry failures -so in fact this may be
> better for all but a server hanging after the HTTP request is initiated.
> too bad we can't alter the timeout for different requests
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]