Re: Job Manager killed by Kubernetes during recovery

Till Rohrmann Wed, 22 Aug 2018 08:04:18 -0700

Great to hear that you've resolved the problem and thanks for sharing the
solution. This will help others who might run into a similar problem.


Cheers,
Till

On Wed, Aug 22, 2018, 16:14 Bruno Aranda <bara...@apache.org> wrote:

> Actually, I have found the issue. It was a simple thing, really, once you
> know it of course.
>
> It was caused by the livenessProbe kicking in too early. For a Flink
> cluster with several jobs, the default 30 seconds I was using (after using
> the Flink helm chart in the examples) was not enough to let the job manager
> to fully recover and start. Increasing that, fixes the issue.
>
> I ended up with a job manager with 4000Gi as limit, 3000Gi requested, and
> configured to use 2048Gb. So I guess that was a red herring for me.
>
> Managed to see what was going on by using the kubectl "describe" action,
> where it was clearly indicated as an event.
>
> Thanks Vino and Till for your time!
>
> Bruno
>
> On Tue, 21 Aug 2018 at 10:21 Till Rohrmann <trohrm...@apache.org> wrote:
>
>> Hi Bruno,
>>
>> in order to debug this problem we would need a bit more information. In
>> particular, the logs of the cluster entrypoint and your K8s deployment
>> specification would be helpful. If you have some memory limits specified
>> these would also be interesting to know.
>>
>> Cheers,
>> Till
>>
>> On Sun, Aug 19, 2018 at 2:43 PM vino yang <yanghua1...@gmail.com> wrote:
>>
>>> Hi Bruno,
>>>
>>> Ping Till for you, he may give you some useful information.
>>>
>>> Thanks, vino.
>>>
>>> Bruno Aranda <bara...@apache.org> 于2018年8月19日周日 上午6:57写道：
>>>
>>>> Hi,
>>>>
>>>> I am experiencing an issue when a job manager is trying to recover
>>>> using a HA setup. When the job manager starts again and tries to resume
>>>> from the last checkpoints, it gets killed by Kubernetes (I guess), since I
>>>> can see the following in the logs while the jobs are deployed:
>>>>
>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>>>>
>>>> I am requesting enough memory for it, 3000Gi, and it is configured to
>>>> use 2048Gb of memory. I have tried to increase the max perm size, but did
>>>> not see an improvement.
>>>>
>>>> Any suggestions to help diagnose this?
>>>>
>>>> I have the following:
>>>>
>>>> Flink 1.6.0 (same with 1.5.1)
>>>> Azure AKS with Kubernetes 1.11
>>>> State management using RocksDB with checkpoints stored in Azure Data
>>>> Lake
>>>>
>>>> Thanks!
>>>>
>>>> Bruno
>>>>
>>>>

Re: Job Manager killed by Kubernetes during recovery

Reply via email to