Actually we figured it out. We need to configure High Availability mode to
recover jobs during new kubernetes deployment.

On Tue, Oct 5, 2021 at 11:39 AM Sharon Xie <sharon.xie...@gmail.com> wrote:

> Hi,
>
> I'm currently running Flink 1.13.2 using kubernetes session mode - native
> kubernetes. When I update the job manager deployment through `kubectl apply
> flink-jobmanager-deployment.yaml`, a new job manager pod is created. I'd
> expect all the task manager pods will re-register with the new JM pod.
> However the new JM pod rejected all the existing task managers that were
> running before the update. It looks like the new JM deployment does not
> recognize the existing TM pods. Is this expected? If so, how can I
> configure the deployment to recover the existing TMs?
>
>
> Thanks,
> Sharon
>
> JM logs:
>
> 2021-10-05 18:00:53,011 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager
> [] - Registering TaskManager with ResourceID
> XXXXX-flink-cluster-local-taskmanager-1-1 (akka.tcp://
> flink@10.244.0.191:6122/user/rpc/taskmanager_0) at ResourceManager
>
> 2021-10-05 18:00:53,033 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager
> [] - Registering TaskManager with ResourceID
> XXXXX-flink-cluster-local-taskmanager-1-1 (akka.tcp://
> flink@10.244.0.191:6122/user/rpc/taskmanager_0) at ResourceManager
>
> 2021-10-05 18:00:53,046 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager
> [] - Worker XXXXXX-flink-cluster-local-taskmanager-1-1 is registered.
>
> 2021-10-05 18:01:45,835 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager
> [] - Stopping worker XXXXX-flink-cluster-local-taskmanager-1-1.
>
> 2021-10-05 18:01:45,835 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] -
> Stopping TaskManager pod XXXXXX-flink-cluster-local-taskmanager-1-1.
>
> 2021-10-05 18:01:45,837 INFO  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager
> [] - Closing TaskExecutor
> connection XXXXXX-flink-cluster-local-taskmanager-1-1 because: TaskExecutor
> exceeded the idle timeout.
>
> 2021-10-05 18:01:45,877 WARN  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager
> [] - Discard registration from TaskExecutor
> XXXXX-flink-cluster-local-taskmanager-1-1 at (akka.tcp://
> flink@10.244.0.191:6122/user/rpc/taskmanager_0) because the framework did
> not recognize it
>
>
>
> TM logs:
>
> 2021-10-05 18:01:45,843 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor
>           [] - Close ResourceManager connection
> 9f664a154b1924918b46d41016324a74.
>
> 2021-10-05 18:01:45,844 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor
>           [] - Connecting to ResourceManager
> akka.tcp://flink@XXXXX-flink-cluster-service
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
>
> 2021-10-05 18:01:45,856 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor
>           [] - Resolved ResourceManager address, beginning registration
>
> 2021-10-05 18:01:45,883 ERROR
> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Fatal
> error occurred in TaskExecutor akka.tcp://
> flink@10.244.0.191:6122/user/rpc/taskmanager_0.
>
> org.apache.flink.util.FlinkException: The TaskExecutor's registration at
> the ResourceManager 
> akka.tcp://flink@XXXXX-flink-cluster-service:6123/user/rpc/resourcemanager_*
> has been rejected: Rejected TaskExecutor registration at the ResourceManger
> because: The ResourceManager does not recognize this TaskExecutor.
>
>
>

Reply via email to