Re: Job manager crash

Yang Wang Thu, 09 Sep 2021 04:38:20 -0700

I think @Robert Metzger <rmetz...@apache.org> is right. You need to check
whether your Kubernetes APIServer is working properly or not(e.g.
overloaded).


Another hint is about the fullGC. Please use the following config option to
enable the GC logs and check the full gc time.
env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log

Simply increasing the renew-deadline might help. But it could not solve the
problem completely.
high-availability.kubernetes.leader-election.lease-duration: 120 s
high-availability.kubernetes.leader-election.renew-deadline: 120 s


Best,
Yang

Robert Metzger <rmetz...@apache.org> 于2021年9月9日周四 下午6:52写道：

> Is the kubernetes server you are using particularly busy? Maybe these
> issues occur because the server is overloaded?
>
> "Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job
> 00000000000000000000000000000000."
> "Completed checkpoint 2193 for job 00000000000000000000000000000000 (474
> bytes in 195 ms)."
> "Triggering checkpoint 2194 (type=CHECKPOINT) @ 1630681492667 for job
> 00000000000000000000000000000000."
> "Completed checkpoint 2194 for job 00000000000000000000000000000000 (474
> bytes in 161 ms)."
> "Renew deadline reached after 60 seconds while renewing lock
> ConfigMapLock: myNs - myJob-dispatcher-leader
> (1bcda6b0-8a5a-4969-b9e4-2257c4478572)"
> "Stopping SessionDispatcherLeaderProcess."
>
> At some point, the leader election mechanism in fabric8 seems to give up.
>
>
> On Tue, Sep 7, 2021 at 10:05 AM mejri houssem <mejrihousse...@gmail.com>
> wrote:
>
>> hello,
>>
>> Here's other logs of the latest jm crash.
>>
>>
>> Le lun. 6 sept. 2021 à 14:18, houssem <mejrihousse...@gmail.com> a
>> écrit :
>>
>>> hello,
>>>
>>> I have three jobs running on my kubernetes cluster and each job has his
>>> own cluster id.
>>>
>>> On 2021/09/06 03:28:10, Yangze Guo <karma...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > The root cause is not "java.lang.NoClassDefFound". The job has been
>>> > running but could not edit the config map
>>> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it
>>> > seems finally disconnected with the API server. Is there another job
>>> > with the same cluster id (myJob) ?
>>> >
>>> > I would also pull Yang Wang.
>>> >
>>> > Best,
>>> > Yangze Guo
>>> >
>>> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <tsreape...@gmail.com>
>>> wrote:
>>> > >
>>> > > Hi!
>>> > >
>>> > > There is a message saying "java.lang.NoClassDefFound Error:
>>> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
>>> visiting HDFS in your job? If yes it seems that your Flink distribution or
>>> your cluster is lacking hadoop classes. Please make sure that there are
>>> hadoop jars in the lib directory of Flink, or your cluster has set the
>>> HADOOP_CLASSPATH environment variable.
>>> > >
>>> > > mejri houssem <mejrihousse...@gmail.com> 于2021年9月4日周六 上午12:15写道：
>>> > >>
>>> > >>
>>> > >> Hello ,
>>> > >>
>>> > >> I am facing a JM crash lately. I am deploying a flink application
>>> cluster on kubernetes.
>>> > >>
>>> > >> When i install my chart using helm everything works fine but after
>>> some time ,the Jm starts to crash
>>> > >>
>>> > >> and then it gets deleted eventually after 5 restarts.
>>> > >>
>>> > >> flink version: 1.12.5 (upgraded recently from 1.12.2)
>>> > >> HA mode : k8s
>>> > >>
>>> > >> Here's the full log of the JM attached file.
>>> >
>>>
>>

Re: Job manager crash

Reply via email to