Re: checkpoint failed due to s3 exception: request timeout

Tony Wei Tue, 28 Aug 2018 22:18:01 -0700

Hi Vino,

Thanks for your quick reply, but I think these two questions are different.
The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You
can see from my screenshot that
showed the checkpoint failed in a short time.


According to configuration, do you mean pass the configuration as program's
input arguments? I don't
think it will work. At least I need to find a way to pass it to s3
filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to
config the global setting for s3
filesystem and I thought it might have a simple way to support this setting
like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <yanghua1...@gmail.com>:

> Hi Tony,
>
> A while ago, I have answered a similar question.[1]
>
> You can try to increase this value appropriately. You can't put this
> configuration in flink-conf.yaml, you can put it in the submit command of
> the job[2], or in the configuration file you specify.
>
> [1]: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
> [2]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/ops/cli.html
>
> Thanks, vino.
>
> Tony Wei <tony19920...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>
>> Hi,
>>
>> I met checkpoint failure problem that cause by s3 exception.
>>
>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>> Your socket connection to the server was not read from or written to within
>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>
>>
>> The full stack trace and screenshot is provided in the attachment.
>>
>> My setting for flink cluster and job:
>>
>>    - flink version 1.4.0
>>    - standalone mode
>>    - 4 slots for each TM
>>    - presto s3 filesystem
>>    - rocksdb statebackend
>>    - local ssd
>>    - enable incremental checkpoint
>>
>> No weird message beside the exception in the log file. No high ratio of
>> GC during the checkpoint
>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>> didn't find something that
>> would related to this failure. Did anyone meet this problem before?
>>
>> Besides, I also found an issue in other aws sdk[1] that mentioned this s3
>> exception as well. One
>> reply said you can passively avoid the problem by raising the max client
>> retires config. So I found
>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>> flink-conf.yaml to config
>> it? If not, how should I do to overwrite the default value of this
>> configuration? Thanks in advance.
>>
>> Best,
>> Tony Wei
>>
>> [1] https://github.com/aws/aws-sdk-php/issues/885
>> [2] https://github.com/prestodb/presto/blob/master/
>> presto-hive/src/main/java/com/facebook/presto/hive/s3/
>> HiveS3Config.java#L218
>>
>

Re: checkpoint failed due to s3 exception: request timeout

Reply via email to