Great to hear that you've resolved the problem and thanks for sharing the
solution. This will help others who might run into a similar problem.


On Wed, Aug 22, 2018, 16:14 Bruno Aranda <> wrote:

> Actually, I have found the issue. It was a simple thing, really, once you
> know it of course.
> It was caused by the livenessProbe kicking in too early. For a Flink
> cluster with several jobs, the default 30 seconds I was using (after using
> the Flink helm chart in the examples) was not enough to let the job manager
> to fully recover and start. Increasing that, fixes the issue.
> I ended up with a job manager with 4000Gi as limit, 3000Gi requested, and
> configured to use 2048Gb. So I guess that was a red herring for me.
> Managed to see what was going on by using the kubectl "describe" action,
> where it was clearly indicated as an event.
> Thanks Vino and Till for your time!
> Bruno
> On Tue, 21 Aug 2018 at 10:21 Till Rohrmann <> wrote:
>> Hi Bruno,
>> in order to debug this problem we would need a bit more information. In
>> particular, the logs of the cluster entrypoint and your K8s deployment
>> specification would be helpful. If you have some memory limits specified
>> these would also be interesting to know.
>> Cheers,
>> Till
>> On Sun, Aug 19, 2018 at 2:43 PM vino yang <> wrote:
>>> Hi Bruno,
>>> Ping Till for you, he may give you some useful information.
>>> Thanks, vino.
>>> Bruno Aranda <> 于2018年8月19日周日 上午6:57写道:
>>>> Hi,
>>>> I am experiencing an issue when a job manager is trying to recover
>>>> using a HA setup. When the job manager starts again and tries to resume
>>>> from the last checkpoints, it gets killed by Kubernetes (I guess), since I
>>>> can see the following in the logs while the jobs are deployed:
>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>>>> I am requesting enough memory for it, 3000Gi, and it is configured to
>>>> use 2048Gb of memory. I have tried to increase the max perm size, but did
>>>> not see an improvement.
>>>> Any suggestions to help diagnose this?
>>>> I have the following:
>>>> Flink 1.6.0 (same with 1.5.1)
>>>> Azure AKS with Kubernetes 1.11
>>>> State management using RocksDB with checkpoints stored in Azure Data
>>>> Lake
>>>> Thanks!
>>>> Bruno

Reply via email to