Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

yidan zhao Wed, 16 Jun 2021 03:56:35 -0700

2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the taskmanager, it is about 1000+,
so I think it is a reasonable value.


1: can not be sure.

Yingjie Cao <kevin.ying...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>
> Hi yidan,
>
> 1. Is the network stable?
> 2. Is there any GC problem?
> 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more 
> information.
> 4. You may try to config these two options: taskmanager.network.retries, 
> taskmanager.network.netty.client.connectTimeoutSec. More relevant options can 
> be found in 'Data Transport Network Stack' section of [2].
> 5. If it is not the above cases, it is may related to [3], you may need to 
> check the number of tcp connection per TM and node.
>
> Hope this helps.
>
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> [2] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> [3] https://issues.apache.org/jira/browse/FLINK-22643
>
> Best,
> Yingjie
>
> yidan zhao <hinobl...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>>
>> Attachment is the exception stack from flink's web-ui. Does anyone
>> have also met this problem?
>>
>> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Reply via email to