This issue had to do with the update strategy for the Flink deployment. When I changed it to the following, it will work:
strategy: type: RollingUpdate rollingUpdate: maxSurge: 0 maxUnavailable: 1 On Tue, Nov 3, 2020 at 1:39 PM Robert Metzger <rmetz...@apache.org> wrote: > Thanks a lot for providing the logs. > > My theory of what is happening is the following: > 1. You are probably increasing the memory for the JobManager, when > changing the jobmanager.memory.flink.size configuration value > 2. Due to this changed memory configuration, Kubernetes, Docker or the > Linux kernel are killing your JobManager process because it allocates too > much memory. > > Flink should not stop like this. Fatal errors are logged explicitly, kill > signals are also logged. > Can you check Kubernetes, Docker, Linux for any signs that they are > killing your JobManager? > > > > On Tue, Nov 3, 2020 at 7:06 PM Claude M <claudemur...@gmail.com> wrote: > >> Thanks for your reply Robert. Please see attached log from the job >> manager, the last line is the only thing I see different from a pod that >> starts up successfully. >> >> On Tue, Nov 3, 2020 at 10:41 AM Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> Hi Claude, >>> >>> I agree that you should be able to restart individual pods with a >>> changed memory configuration. Can you share the full Jobmanager log of the >>> failed restart attempt? >>> >>> I don't think that the log statement you've posted explains a start >>> failure. >>> >>> Regards, >>> Robert >>> >>> On Tue, Nov 3, 2020 at 2:33 AM Claude M <claudemur...@gmail.com> wrote: >>> >>>> >>>> Hello, >>>> >>>> I have Flink 1.10.2 installed in a Kubernetes cluster. >>>> Anytime I make a change to the flink.conf, the Flink jobmanager pod >>>> fails to restart. >>>> For example, I modified the following memory setting in the flink.conf: >>>> jobmanager.memory.flink.size. >>>> After I deploy the change, the pod fails to restart and the following >>>> is seen in the log: >>>> >>>> WARN >>>> org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - >>>> Error while retrieving the leader gateway. Retrying to connect to >>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher. >>>> >>>> The pod can be restored by doing one of the following but these are not >>>> acceptable solutions: >>>> >>>> - Revert the changes made to the flink.conf to the previous settings >>>> - Remove the Flink Kubernetes deployment before doing a deployment >>>> - Delete the flink cluster folder in Zookeeper >>>> >>>> I don't understand why making any changes in the flink.conf causes this >>>> problem. >>>> Any help is appreciated. >>>> >>>> >>>> Thank You >>>> >>>