HI Yang, Roman,

Thanks for the information and sorry for the late reply. Seems like the
Kubernetes node restarted during the Flink finalization stage.
I think that is the root cause.

Regards,
Oscar

On Wed, Oct 27, 2021 at 4:20 PM Yang Wang <danrtsey...@gmail.com> wrote:

> Hi,
>
> I think Roman is right. It seems that the JobManager is relaunched again
> by K8s after Flink has
> already deregister the application(aka delete the JobManager K8s
> deployment).
>
> One possible reason might be that kubelet is too late to know the
> JobManager deployment is deleted.
> So it relaunch the JobManager pod when it terminated with exit code 0.
>
>
> Best,
> Yang
>
>
> Roman Khachatryan <ro...@apache.org> 于2021年10月26日周二 下午6:17写道:
>
>> Thanks for sharing this,
>> The sequence of events the log seems strange to me:
>>
>> 2021-10-17 03:05:55,801 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster                 [] -
>> Close ResourceManager connection c1092812cfb2853a5576ffd78e346189:
>> Stopping JobMaster for job 'rt-match_12.4.5_8d48b21a'
>> (00000000000000000000000000000000).
>> 2021-10-17 03:05:59,382 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -
>> Starting KubernetesApplicationClusterEntrypoint (Version: 1.14.0,
>> Scala: 2.12, Rev:460b386, Date:2021-09-22T08:39:40+02:00)
>> 2021-10-17 03:06:00,251 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -
>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>> 2021-10-17 03:06:04,355 ERROR
>> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector []
>> - Exception occurred while acquiring lock 'ConfigMapLock: flink-ns -
>> match-70958037-f414-4925-9d60-19e90d12abc0-restserver-leader
>> (ef5c2463-2d66-4dce-a023-4b8a50d7acff)'
>>
>> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
>> Unable to create ConfigMapLock
>> Caused by: io.fabric8.kubernetes.client.KubernetesClientException:
>> Operation: [create]  for kind: [ConfigMap]  with name:
>> [match-70958037-f414-4925-9d60-19e90d12abc0-restserver-leader]  in
>> namespace: [flink-ns]  failed.
>> Caused by: java.io.InterruptedIOException
>>
>> It looks like KubernetesApplicationClusterEntrypoint is re-started in
>> the middle of shutdown and, as a result, the resources it (re)creates
>> aren't clean up.
>>
>> Could you please also share Kubernetes logs and resource definitions
>> to validate the above assumption?
>>
>> Regards,
>> Roman
>>
>> On Mon, Oct 25, 2021 at 6:15 AM Hua Wei Chen <oscar.chen....@gmail.com>
>> wrote:
>> >
>> > Hi all,
>> >
>> > We have Flink jobs run on batch mode and get the job status via
>> JobHandler.onJobExecuted()[1].
>> >
>> > Base on the thread[2], we expected the Configmaps will be cleaned up
>> after execution successfully.
>> >
>> > But we found some Configmaps not be cleanup after job success. On the
>> other hand, the Configmaps contents and the labels are removed.
>> >
>> > Here is one of the Configmaps.
>> >
>> > ```
>> > apiVersion: v1
>> > kind: ConfigMap
>> > metadata:
>> >   name:
>> match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader
>> >   namespace: app-flink
>> >   selfLink: >-
>> >
>>  
>> /api/v1/namespaces/app-flink/configmaps/match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader
>> >   uid: 80c79c87-d6e2-4641-b13f-338c3d3154b0
>> >   resourceVersion: '578806788'
>> >   creationTimestamp: '2021-10-21T17:06:48Z'
>> >   annotations:
>> >     control-plane.alpha.kubernetes.io/leader: >-
>> >
>>  
>> {"holderIdentity":"3da40a4a-0346-49e5-8d18-b04a68239bf3","leaseDuration":15.000000000,"acquireTime":"2021-10-21T17:06:48.092264Z","renewTime":"2021-10-21T17:06:48.092264Z","leaderTransitions":0}
>> >   managedFields:
>> >     - manager: okhttp
>> >       operation: Update
>> >       apiVersion: v1
>> >       time: '2021-10-21T17:06:48Z'
>> >       fieldsType: FieldsV1
>> >       fieldsV1:
>> >         'f:metadata':
>> >           'f:annotations':
>> >             .: {}
>> >             'f:control-plane.alpha.kubernetes.io/leader': {}
>> > data: {}
>> > ```
>> >
>> >
>> > Our Flink apps run on ver. 1.14.0.
>> > Thanks!
>> >
>> > BR,
>> > Oscar
>> >
>> >
>> > Reference:
>> > [1] JobListener (Flink : 1.15-SNAPSHOT API) (apache.org)
>> > [2]
>> https://lists.apache.org/list.html?user@flink.apache.org:lte=1M:High%20availability%20data%20clean%20up%20
>> >
>>
>

Reply via email to