This issue had to do with the update strategy for the Flink deployment.
When I changed it to the following, it will work:

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1

On Tue, Nov 3, 2020 at 1:39 PM Robert Metzger <rmetz...@apache.org> wrote:

> Thanks a lot for providing the logs.
>
> My theory of what is happening is the following:
> 1. You are probably increasing the memory for the JobManager, when
> changing the  jobmanager.memory.flink.size configuration value
> 2. Due to this changed memory configuration, Kubernetes, Docker or the
> Linux kernel are killing your JobManager process because it allocates too
> much memory.
>
> Flink should not stop like this. Fatal errors are logged explicitly, kill
> signals are also logged.
> Can you check Kubernetes, Docker, Linux for any signs that they are
> killing your JobManager?
>
>
>
> On Tue, Nov 3, 2020 at 7:06 PM Claude M <claudemur...@gmail.com> wrote:
>
>> Thanks for your reply Robert.  Please see attached log from the job
>> manager, the last line is the only thing I see different from a pod that
>> starts up successfully.
>>
>> On Tue, Nov 3, 2020 at 10:41 AM Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>>> Hi Claude,
>>>
>>> I agree that you should be able to restart individual pods with a
>>> changed memory configuration. Can you share the full Jobmanager log of the
>>> failed restart attempt?
>>>
>>> I don't think that the log statement you've posted explains a start
>>> failure.
>>>
>>> Regards,
>>> Robert
>>>
>>> On Tue, Nov 3, 2020 at 2:33 AM Claude M <claudemur...@gmail.com> wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> I have Flink 1.10.2 installed in a Kubernetes cluster.
>>>> Anytime I make a change to the flink.conf, the Flink jobmanager pod
>>>> fails to restart.
>>>> For example, I modified the following memory setting in the flink.conf:
>>>> jobmanager.memory.flink.size.
>>>> After I deploy the change, the pod fails to restart and the following
>>>> is seen in the log:
>>>>
>>>> WARN
>>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
>>>> Error while retrieving the leader gateway. Retrying to connect to
>>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.
>>>>
>>>> The pod can be restored by doing one of the following but these are not
>>>> acceptable solutions:
>>>>
>>>>    - Revert the changes made to the flink.conf to the previous settings
>>>>    - Remove the Flink Kubernetes deployment before doing a deployment
>>>>    - Delete the flink cluster folder in Zookeeper
>>>>
>>>> I don't understand why making any changes in the flink.conf causes this
>>>> problem.
>>>> Any help is appreciated.
>>>>
>>>>
>>>> Thank You
>>>>
>>>

Reply via email to