[jira] [Commented] (FLINK-37271) Add network channel reconnect capability

Piotr Nowojski (Jira) Wed, 12 Feb 2025 01:30:24 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926316#comment-17926316
 ]


Piotr Nowojski commented on FLINK-37271:
----------------------------------------

I'm not concerned about the network throughput. The overhead of extra ACKs 
would be negligible.

This has been considered in the past for various of reasons. It has never been 
done because:
* It requires quite a lot of effort.
* It's not that important for vast, vast majority of use cases. The fraction of 
job failovers due to network connectivity issues in healthy setups is 
minuscule. 
    * Most job failovers are caused by exceptions coming from the job, and to 
preserve state consistency, such subtasks have to failover and that forces 
whole job's region to failover anyway.
* It would increase network buffers requirement - output buffers would have to 
be kept in memory for a longer, until they are acknowledged down stream, but I 
would hope that's negligible especially for setups using RocksDB. For high 
throughput setups using HashMapStateBackend, it's hard to say - it would have 
to be measured/benchmarked. 

If you would like to put effort into this, I would suggest to first write a 
FLIP proposal and publish it on the dev mailing list and let have a discussion 
there - this change is large enough that it does require a FLIP.

> Add network channel reconnect capability
> ----------------------------------------
>
>                 Key: FLINK-37271
>                 URL: https://issues.apache.org/jira/browse/FLINK-37271
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: Zhenqiu Huang
>            Priority: Minor
>             Fix For: 1.20.1, 1.20.2
>
>
> In our org, we are using the security proxy to achieve inter host secured 
> communication. During the proxy rollout, channel between TMs will be 
> disconnected. It will cause downtime. Beside this, we can't guarantee the 
> rollout of proxy to all of the host at the same. It could cause a job fail 
> multiple times during the proxy rollout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37271) Add network channel reconnect capability

Reply via email to