[jira] [Commented] (FLINK-20249) Rethink the necessity of the k8s internal Service even in non-HA mode

Till Rohrmann (Jira) Tue, 24 Nov 2020 05:18:04 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238116#comment-17238116
 ]

Till Rohrmann commented on FLINK-20249:
---------------------------------------

Thanks for looking into this problem [~xintongsong] and [~fly_in_gis]. I agree 
with you that it would be nice to deduce the actual resources from the 
recovered resources and this could improve things a bit.

A question which came to my mind is whether we really need to support JM 
failover w/o HA. I think if you want to tolerate JM failures, then you should 
enable HA because otherwise you will lose information about your submitted jobs 
and the completed checkpoints. Consequently, the TMs should get notified much 
faster. Hence, it sounds to me a bit as if we are trying to improve the 
situation for an invalid use case scenario.

Independent of this, [~xintongsong] feel free to open a ticket for 1).

To sum things up, is there really a real problem we have to fix here 
[~jiang7chengzitc]?

> Rethink the necessity of the k8s internal Service even in non-HA mode
> ---------------------------------------------------------------------
>
>                 Key: FLINK-20249
>                 URL: https://issues.apache.org/jira/browse/FLINK-20249
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.11.0
>            Reporter: Ruguo Yu
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>         Attachments: k8s internal service - in english.pdf, k8s internal 
> service - v2.pdf, k8s internal service.pdf
>
>
> In non-HA mode, k8s will create internal service that directs the 
> communication from TaskManagers Pod to JobManager Pod, and TM Pods could 
> re-register to the new JM Pod once a JM Pod failover occurs.
> However recently I do an experiment and find a problem that k8s will first 
> create new TM pods and then destory old TM pods after a period of time once 
> JM Pod failover (note: new JM podIP has changed), then job will be reschedule 
> by JM on new TM pods, it means new TM has been registered to JM. 
> During this process, internal service is active all the time, but I think it 
> is not necessary that keep this internal service, In other words, wo can weed 
> out internal service and use JM podIP for TM pods communication with JM pod, 
> In this case, it be consistent with HA mode.
> Finally，related experiments is in attached (k8s internal service.pdf).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20249) Rethink the necessity of the k8s internal Service even in non-HA mode

Reply via email to