[ https://issues.apache.org/jira/browse/FLINK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926316#comment-17926316 ]
Piotr Nowojski commented on FLINK-37271: ---------------------------------------- I'm not concerned about the network throughput. The overhead of extra ACKs would be negligible. This has been considered in the past for various of reasons. It has never been done because: * It requires quite a lot of effort. * It's not that important for vast, vast majority of use cases. The fraction of job failovers due to network connectivity issues in healthy setups is minuscule. * Most job failovers are caused by exceptions coming from the job, and to preserve state consistency, such subtasks have to failover and that forces whole job's region to failover anyway. * It would increase network buffers requirement - output buffers would have to be kept in memory for a longer, until they are acknowledged down stream, but I would hope that's negligible especially for setups using RocksDB. For high throughput setups using HashMapStateBackend, it's hard to say - it would have to be measured/benchmarked. If you would like to put effort into this, I would suggest to first write a FLIP proposal and publish it on the dev mailing list and let have a discussion there - this change is large enough that it does require a FLIP. > Add network channel reconnect capability > ---------------------------------------- > > Key: FLINK-37271 > URL: https://issues.apache.org/jira/browse/FLINK-37271 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Reporter: Zhenqiu Huang > Priority: Minor > Fix For: 1.20.1, 1.20.2 > > > In our org, we are using the security proxy to achieve inter host secured > communication. During the proxy rollout, channel between TMs will be > disconnected. It will cause downtime. Beside this, we can't guarantee the > rollout of proxy to all of the host at the same. It could cause a job fail > multiple times during the proxy rollout. -- This message was sent by Atlassian Jira (v8.20.10#820010)