Hi yidan, 1. Is the network stable? 2. Is there any GC problem? 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
Hope this helps. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ [3] https://issues.apache.org/jira/browse/FLINK-22643 Best, Yingjie yidan zhao <hinobl...@gmail.com> 于2021年6月16日周三 下午3:36写道: > Attachment is the exception stack from flink's web-ui. Does anyone > have also met this problem? > > Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > each 28G mem. >