HI Yang, Roman, Thanks for the information and sorry for the late reply. Seems like the Kubernetes node restarted during the Flink finalization stage. I think that is the root cause.
Regards, Oscar On Wed, Oct 27, 2021 at 4:20 PM Yang Wang <danrtsey...@gmail.com> wrote: > Hi, > > I think Roman is right. It seems that the JobManager is relaunched again > by K8s after Flink has > already deregister the application(aka delete the JobManager K8s > deployment). > > One possible reason might be that kubelet is too late to know the > JobManager deployment is deleted. > So it relaunch the JobManager pod when it terminated with exit code 0. > > > Best, > Yang > > > Roman Khachatryan <ro...@apache.org> 于2021年10月26日周二 下午6:17写道: > >> Thanks for sharing this, >> The sequence of events the log seems strange to me: >> >> 2021-10-17 03:05:55,801 INFO >> org.apache.flink.runtime.jobmaster.JobMaster [] - >> Close ResourceManager connection c1092812cfb2853a5576ffd78e346189: >> Stopping JobMaster for job 'rt-match_12.4.5_8d48b21a' >> (00000000000000000000000000000000). >> 2021-10-17 03:05:59,382 INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - >> Starting KubernetesApplicationClusterEntrypoint (Version: 1.14.0, >> Scala: 2.12, Rev:460b386, Date:2021-09-22T08:39:40+02:00) >> 2021-10-17 03:06:00,251 INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - >> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. >> 2021-10-17 03:06:04,355 ERROR >> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector [] >> - Exception occurred while acquiring lock 'ConfigMapLock: flink-ns - >> match-70958037-f414-4925-9d60-19e90d12abc0-restserver-leader >> (ef5c2463-2d66-4dce-a023-4b8a50d7acff)' >> >> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException: >> Unable to create ConfigMapLock >> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: >> Operation: [create] for kind: [ConfigMap] with name: >> [match-70958037-f414-4925-9d60-19e90d12abc0-restserver-leader] in >> namespace: [flink-ns] failed. >> Caused by: java.io.InterruptedIOException >> >> It looks like KubernetesApplicationClusterEntrypoint is re-started in >> the middle of shutdown and, as a result, the resources it (re)creates >> aren't clean up. >> >> Could you please also share Kubernetes logs and resource definitions >> to validate the above assumption? >> >> Regards, >> Roman >> >> On Mon, Oct 25, 2021 at 6:15 AM Hua Wei Chen <oscar.chen....@gmail.com> >> wrote: >> > >> > Hi all, >> > >> > We have Flink jobs run on batch mode and get the job status via >> JobHandler.onJobExecuted()[1]. >> > >> > Base on the thread[2], we expected the Configmaps will be cleaned up >> after execution successfully. >> > >> > But we found some Configmaps not be cleanup after job success. On the >> other hand, the Configmaps contents and the labels are removed. >> > >> > Here is one of the Configmaps. >> > >> > ``` >> > apiVersion: v1 >> > kind: ConfigMap >> > metadata: >> > name: >> match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader >> > namespace: app-flink >> > selfLink: >- >> > >> >> /api/v1/namespaces/app-flink/configmaps/match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader >> > uid: 80c79c87-d6e2-4641-b13f-338c3d3154b0 >> > resourceVersion: '578806788' >> > creationTimestamp: '2021-10-21T17:06:48Z' >> > annotations: >> > control-plane.alpha.kubernetes.io/leader: >- >> > >> >> {"holderIdentity":"3da40a4a-0346-49e5-8d18-b04a68239bf3","leaseDuration":15.000000000,"acquireTime":"2021-10-21T17:06:48.092264Z","renewTime":"2021-10-21T17:06:48.092264Z","leaderTransitions":0} >> > managedFields: >> > - manager: okhttp >> > operation: Update >> > apiVersion: v1 >> > time: '2021-10-21T17:06:48Z' >> > fieldsType: FieldsV1 >> > fieldsV1: >> > 'f:metadata': >> > 'f:annotations': >> > .: {} >> > 'f:control-plane.alpha.kubernetes.io/leader': {} >> > data: {} >> > ``` >> > >> > >> > Our Flink apps run on ver. 1.14.0. >> > Thanks! >> > >> > BR, >> > Oscar >> > >> > >> > Reference: >> > [1] JobListener (Flink : 1.15-SNAPSHOT API) (apache.org) >> > [2] >> https://lists.apache.org/list.html?user@flink.apache.org:lte=1M:High%20availability%20data%20clean%20up%20 >> > >> >