You are right. Starting multiple JobManagers could help when the pod is deleted and there's not enough resources in the cluster to start a new one. For most cases, the JobManager container will be restarted locally without scheduling a new Kubernetes pod[1].
The "already exists" error comes from the fabric8 Kubernetes-client. It is somewhat reasonable because a same name ConfigMap might be already created manually beforehand. In the Flink use case, we could simply ignore this error. For the first exception "*Caused by: java.io.FileNotFoundException: /opt/flink/.kube/config (No such file or directory)*", I think you need to share the full log file of all the JobManagers. [1]. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy Best, Yang Tamir Sagi <tamir.s...@niceactimize.com> 于2022年9月8日周四 14:28写道: > Hey Yang, > > Thank you for fast response. > > I get your point but, assuming 3 Job managers are up, in case the leader > fails, one of the other 2 should become the new leader, no? > > If the cluster fails, the new leader should handle that. > > Another scenario could be that the Job manager stops(get killed by k8s > due to memory, CPU limitations, bugs etc...) while TMs are still > operating, and the cluster is active. In some cases, due to resources > limitation, k8s will not be able to get a new instance right away, until > auto-scale takes place(The pod remains in pending state). It seems like we > do achieve resilience by having HA enabled in Native k8s mode. > > What do you think? > > Given that you are running multiple JobManagers, it does not matter for > the "already exists" exception during leader election. > > Should we ignore such error? if so , it should be a warning then > > What about the 1st error we encountered regarding the kube/config file > exception? > > > Thank you so much, > Best, > Tamir > > ------------------------------ > *From:* Yang Wang <danrtsey...@gmail.com> > *Sent:* Thursday, September 8, 2022 7:08 AM > *To:* Tamir Sagi <tamir.s...@niceactimize.com> > *Cc:* user@flink.apache.org <user@flink.apache.org>; Lihi Peretz < > lihi.per...@niceactimize.com> > *Subject:* Re: [Flink 1.15.1 - Application mode native k8s Exception] - > Exception occurred while acquiring lock 'ConfigMapLock > > > *EXTERNAL EMAIL* > > > Given that you are running multiple JobManagers, it does not matter for > the "already exists" exception during leader election. > > BTW, I think running multiple JobManagers does not take enough advantages > when deploying Flink on Kubernetes. Because a new JobManager will be > started immediately once the old one crashed. > And Flink JobManager always needs to recover the job from the latest > checkpoint no matter how many JobManager are running. > > Best, > Yang > > Tamir Sagi <tamir.s...@niceactimize.com> 于2022年9月5日周一 21:48写道: > > Hey Yang, > > The flink-conf.yaml submitted to the cluster does not contain > "kubernetes.config.file" > at all. > In addition, I verified flink config maps under cluster's namespace do not > contain "kubernetes.config.file". > > In addition, we also noticed the following exception (appears to happen > sporadically) > > 2022-09-04T21:06:35,231][Error] {} [i.f.k.c.e.l.LeaderElector]: Exception > occurred while acquiring lock 'ConfigMapLock: dev-0-flink-jobs - > data-agg-events-insertion-cluster-config-map > (fa3dbbc5-1753-46cd-afaf-0baf8ff0947f)' > io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException: > Unable to create ConfigMapLock > > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://172.20.0.1/api/v1/namespaces/dev-0-flink-jobs/configmaps. > Message: configmaps "data-agg-events-insertion-cluster-config-map" already > exists. > > Log file is enclosed. > > Thanks, > Tamir. > > ------------------------------ > *From:* Yang Wang <danrtsey...@gmail.com> > *Sent:* Monday, September 5, 2022 3:03 PM > *To:* Tamir Sagi <tamir.s...@niceactimize.com> > *Cc:* user@flink.apache.org <user@flink.apache.org>; Lihi Peretz < > lihi.per...@niceactimize.com> > *Subject:* Re: [Flink 1.15.1 - Application mode native k8s Exception] - > Exception occurred while acquiring lock 'ConfigMapLock > > > *EXTERNAL EMAIL* > > > Could you please check whether the "kubernetes.config.file" is configured > to /opt/flink/.kube/config in the Flink configmap? > It should be removed before creating the Flink configmap. > > Best, > Yang > > Tamir Sagi <tamir.s...@niceactimize.com> 于2022年9月4日周日 18:08写道: > > Hey All, > > We recently updated to Flink 1.15.1. We deploy stream cluster in > Application mode in Native K8S.(Deployed on Amazon EKS). The cluster is > configured with Kubernetes HA Service, Minimum 3 replicas of Job manager > and pod-template which is configured with topologySpreadConstraints to > enable distribution across different availability zones. > HA storage directory is on S3. > > The cluster is deployed and running properly, however, after a while we > noticed the following exception in Job manager instance(the log file is > enclosed) > > 2022-09-04T02:05:33,097][Error] {} [i.f.k.c.e.l.LeaderElector]: Exception > occurred while acquiring lock 'ConfigMapLock: dev-0-flink-jobs - > data-agg-events-insertion-cluster-config-map > (b6da2ae2-ad2b-471c-801e-ea460a348fab)' > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] > for kind: [ConfigMap] with name: > [data-agg-events-insertion-cluster-config-map] in namespace: > [dev-0-flink-jobs] failed. > Caused by: java.io.FileNotFoundException: /opt/flink/.kube/config (No such > file or directory) > at java.io.FileInputStream.open0(Native Method) ~[?:?] > at java.io.FileInputStream.open(Unknown Source) ~[?:?] > at java.io.FileInputStream.<init>(Unknown Source) ~[?:?] > at > org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:354) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:15) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3494) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.internal.KubeConfigUtils.parseConfig(KubeConfigUtils.java:42) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:44) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createApplicableInterceptors$6(HttpClientUtils.java:290) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229) > ~[flink-dist-1.15.1.jar:1.15.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.RealCall.execute(RealCall.java:81) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.retryWithExponentialBackoff(OperationSupport.java:585) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:488) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:470) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:830) > ~[flink-dist-1.15.1.jar:1.15.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:200) > ~[flink-dist-1.15.1.jar:1.15.1] > ... 12 more > > Why is Kube/config needed in Native K8s, should not service account be > checked instead? > > Are we missing something? > > Thanks, > Tamir. > > > Confidentiality: This communication and any attachments are intended for > the above-named persons only and may be confidential and/or legally > privileged. Any opinions expressed in this communication are not > necessarily those of NICE Actimize. If this communication has come to you > in error you must take no action based on it, nor must you copy or show it > to anyone; please delete/destroy and inform the sender by e-mail > immediately. > Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. > Viruses: Although we have taken steps toward ensuring that this e-mail and > attachments are free from any virus, we advise that in keeping with good > computing practice the recipient should ensure they are actually virus free. > > > Confidentiality: This communication and any attachments are intended for > the above-named persons only and may be confidential and/or legally > privileged. Any opinions expressed in this communication are not > necessarily those of NICE Actimize. If this communication has come to you > in error you must take no action based on it, nor must you copy or show it > to anyone; please delete/destroy and inform the sender by e-mail > immediately. > Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. > Viruses: Although we have taken steps toward ensuring that this e-mail and > attachments are free from any virus, we advise that in keeping with good > computing practice the recipient should ensure they are actually virus free. > > > Confidentiality: This communication and any attachments are intended for > the above-named persons only and may be confidential and/or legally > privileged. Any opinions expressed in this communication are not > necessarily those of NICE Actimize. If this communication has come to you > in error you must take no action based on it, nor must you copy or show it > to anyone; please delete/destroy and inform the sender by e-mail > immediately. > Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. > Viruses: Although we have taken steps toward ensuring that this e-mail and > attachments are free from any virus, we advise that in keeping with good > computing practice the recipient should ensure they are actually virus free. >