[ https://issues.apache.org/jira/browse/FLINK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijiang updated FLINK-15416: ----------------------------- Affects Version/s: 1.11.1 1.11.0 1.10.1 > Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel > ----------------------------------------------------------------------- > > Key: FLINK-15416 > URL: https://issues.apache.org/jira/browse/FLINK-15416 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.11.1 > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > > We run a flink with 256 TMs in production. The job internally has keyby > logic. Thus, it builds a 256 * 256 communication channels. An outage happened > when there is a chip internal link of one of the network switchs broken that > connecting these machines. During the outage, the flink can't restart > successfully as there is always an exception like "Connecting the channel > failed: Connecting to remote task manager + '****/10.14.139.6:41300' has > failed. This might indicate that the remote task manager has been lost. > After deep investigation with the network infrastructure team, we found there > are 6 switchs connecting with these machines. Each switch has 32 physcal > links. Every socket is round-robin assigned to each of links for load > balances. Thus, there is always average 256 * 256 / 6 * 32 * 2 = 170 > channels will be assigned to the broken link. The issue lasted for 4 hours > until we found the broken link and restart the problematic switch. > Given this, we found that the retry of creating channel will help to resolve > this issue. For our networking topology, we can set retry to 2. As 170 / (132 > * 132) < 1, which means after retry twice no channel in 170 channels will be > assigned to the broken link in the average case. > I think it is valuable fix for this kind of partial network partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)