Actually we figured it out. We need to configure High Availability mode to recover jobs during new kubernetes deployment.
On Tue, Oct 5, 2021 at 11:39 AM Sharon Xie <sharon.xie...@gmail.com> wrote: > Hi, > > I'm currently running Flink 1.13.2 using kubernetes session mode - native > kubernetes. When I update the job manager deployment through `kubectl apply > flink-jobmanager-deployment.yaml`, a new job manager pod is created. I'd > expect all the task manager pods will re-register with the new JM pod. > However the new JM pod rejected all the existing task managers that were > running before the update. It looks like the new JM deployment does not > recognize the existing TM pods. Is this expected? If so, how can I > configure the deployment to recover the existing TMs? > > > Thanks, > Sharon > > JM logs: > > 2021-10-05 18:00:53,011 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager > [] - Registering TaskManager with ResourceID > XXXXX-flink-cluster-local-taskmanager-1-1 (akka.tcp:// > flink@10.244.0.191:6122/user/rpc/taskmanager_0) at ResourceManager > > 2021-10-05 18:00:53,033 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager > [] - Registering TaskManager with ResourceID > XXXXX-flink-cluster-local-taskmanager-1-1 (akka.tcp:// > flink@10.244.0.191:6122/user/rpc/taskmanager_0) at ResourceManager > > 2021-10-05 18:00:53,046 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager > [] - Worker XXXXXX-flink-cluster-local-taskmanager-1-1 is registered. > > 2021-10-05 18:01:45,835 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager > [] - Stopping worker XXXXX-flink-cluster-local-taskmanager-1-1. > > 2021-10-05 18:01:45,835 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - > Stopping TaskManager pod XXXXXX-flink-cluster-local-taskmanager-1-1. > > 2021-10-05 18:01:45,837 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager > [] - Closing TaskExecutor > connection XXXXXX-flink-cluster-local-taskmanager-1-1 because: TaskExecutor > exceeded the idle timeout. > > 2021-10-05 18:01:45,877 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager > [] - Discard registration from TaskExecutor > XXXXX-flink-cluster-local-taskmanager-1-1 at (akka.tcp:// > flink@10.244.0.191:6122/user/rpc/taskmanager_0) because the framework did > not recognize it > > > > TM logs: > > 2021-10-05 18:01:45,843 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor > [] - Close ResourceManager connection > 9f664a154b1924918b46d41016324a74. > > 2021-10-05 18:01:45,844 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor > [] - Connecting to ResourceManager > akka.tcp://flink@XXXXX-flink-cluster-service > :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000). > > 2021-10-05 18:01:45,856 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor > [] - Resolved ResourceManager address, beginning registration > > 2021-10-05 18:01:45,883 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Fatal > error occurred in TaskExecutor akka.tcp:// > flink@10.244.0.191:6122/user/rpc/taskmanager_0. > > org.apache.flink.util.FlinkException: The TaskExecutor's registration at > the ResourceManager > akka.tcp://flink@XXXXX-flink-cluster-service:6123/user/rpc/resourcemanager_* > has been rejected: Rejected TaskExecutor registration at the ResourceManger > because: The ResourceManager does not recognize this TaskExecutor. > > >