Great to hear that you've resolved the problem and thanks for sharing the solution. This will help others who might run into a similar problem.
Cheers, Till On Wed, Aug 22, 2018, 16:14 Bruno Aranda <bara...@apache.org> wrote: > Actually, I have found the issue. It was a simple thing, really, once you > know it of course. > > It was caused by the livenessProbe kicking in too early. For a Flink > cluster with several jobs, the default 30 seconds I was using (after using > the Flink helm chart in the examples) was not enough to let the job manager > to fully recover and start. Increasing that, fixes the issue. > > I ended up with a job manager with 4000Gi as limit, 3000Gi requested, and > configured to use 2048Gb. So I guess that was a red herring for me. > > Managed to see what was going on by using the kubectl "describe" action, > where it was clearly indicated as an event. > > Thanks Vino and Till for your time! > > Bruno > > On Tue, 21 Aug 2018 at 10:21 Till Rohrmann <trohrm...@apache.org> wrote: > >> Hi Bruno, >> >> in order to debug this problem we would need a bit more information. In >> particular, the logs of the cluster entrypoint and your K8s deployment >> specification would be helpful. If you have some memory limits specified >> these would also be interesting to know. >> >> Cheers, >> Till >> >> On Sun, Aug 19, 2018 at 2:43 PM vino yang <yanghua1...@gmail.com> wrote: >> >>> Hi Bruno, >>> >>> Ping Till for you, he may give you some useful information. >>> >>> Thanks, vino. >>> >>> Bruno Aranda <bara...@apache.org> 于2018年8月19日周日 上午6:57写道: >>> >>>> Hi, >>>> >>>> I am experiencing an issue when a job manager is trying to recover >>>> using a HA setup. When the job manager starts again and tries to resume >>>> from the last checkpoints, it gets killed by Kubernetes (I guess), since I >>>> can see the following in the logs while the jobs are deployed: >>>> >>>> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. >>>> >>>> I am requesting enough memory for it, 3000Gi, and it is configured to >>>> use 2048Gb of memory. I have tried to increase the max perm size, but did >>>> not see an improvement. >>>> >>>> Any suggestions to help diagnose this? >>>> >>>> I have the following: >>>> >>>> Flink 1.6.0 (same with 1.5.1) >>>> Azure AKS with Kubernetes 1.11 >>>> State management using RocksDB with checkpoints stored in Azure Data >>>> Lake >>>> >>>> Thanks! >>>> >>>> Bruno >>>> >>>>