[ https://issues.apache.org/jira/browse/FLINK-20249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237906#comment-17237906 ]
Xintong Song commented on FLINK-20249: -------------------------------------- Thanks for the discussion and pulling me in. I think it does make sense to have the internal service, in terms of allowing TMs to re-register to the new JM. IMO, [~jiang7chengzitc]'s experiment results reveal that Flink is currently not benefiting from reusing the old TMs, and we should find out why is that and whether it can be fixed. [~fly_in_gis] and I did a some investigation locally, and we found two issues which might relate to this problem. # After a JM failover and before the old TMs register, resource manager will allocate new resources on the job's resources requests. I think the problem here is that, RM does not add the recovered resources to its pending resources (i.e., SlotManagerImpl#pendingSlots). Given the contract that there could be workers with heterogenous resources, in order to understand what pending resources should be added before the workers register, we would need to attach some information (e.g., WorkerResourceSpec) to the meta of pod/container (e.g., pod label/annotation, container tag). # It may take longer for old TMs to re-register than for new TMs to register. In non-HA mode, TMs wait for heartbeat timeout (typically tens of seconds) to disconnect from the old JM and tries to connect to the new JM. This explains [~jiang7chengzitc]'s observation that the job is always scheduled to new TMs. In HA mode, TMs switched to the new JM earlier because they are notified with JM leader changes. But there's still a chance that the new TMs got register earlier. I believe (1) is the underlying problem that we would want to solve, and (2) should no longer be a problem if (1) is fixed. If there's a consensus, I can open another ticket and try to work on it. WDYT? [~jiang7chengzitc] [~trohrmann] [~fly_in_gis] > Rethink the necessity of the k8s internal Service even in non-HA mode > --------------------------------------------------------------------- > > Key: FLINK-20249 > URL: https://issues.apache.org/jira/browse/FLINK-20249 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes > Affects Versions: 1.11.0 > Reporter: Ruguo Yu > Priority: Minor > Labels: pull-request-available > Fix For: 1.12.0 > > Attachments: k8s internal service - in english.pdf, k8s internal > service - v2.pdf, k8s internal service.pdf > > > In non-HA mode, k8s will create internal service that directs the > communication from TaskManagers Pod to JobManager Pod, and TM Pods could > re-register to the new JM Pod once a JM Pod failover occurs. > However recently I do an experiment and find a problem that k8s will first > create new TM pods and then destory old TM pods after a period of time once > JM Pod failover (note: new JM podIP has changed), then job will be reschedule > by JM on new TM pods, it means new TM has been registered to JM. > During this process, internal service is active all the time, but I think it > is not necessary that keep this internal service, In other words, wo can weed > out internal service and use JM podIP for TM pods communication with JM pod, > In this case, it be consistent with HA mode. > Finally,related experiments is in attached (k8s internal service.pdf). -- This message was sent by Atlassian Jira (v8.3.4#803005)