Re: Job manager crash

2021-09-18 Thread Yang Wang
The GC log looks quite normal. Maybe the K8s APIServer is overloaded. Best, Yang houssem 于2021年9月13日周一 下午5:11写道: > hello, > > here's some of full GC log: > > OpenJDK 64-Bit Server VM (25.232-b09) for linux-amd64 JRE (1.8.0_232-b09), > built on Oct 18 2019 15:04:46 by "jenkins" with gcc 4.8.2 20

Re: Job manager crash

2021-09-13 Thread houssem
hello, here's some of full GC log: OpenJDK 64-Bit Server VM (25.232-b09) for linux-amd64 JRE (1.8.0_232-b09), built on Oct 18 2019 15:04:46 by "jenkins" with gcc 4.8.2 20140120 (Red Hat 4.8.2-15) Memory: 4k page, physical 976560k(946672k free), swap 0k(0k free) CommandLine flags: -XX:Compresse

Re: Job manager crash

2021-09-09 Thread mejri houssem
thanks for the response, with respect to the api-server i don't think i can do so much about it because i am just using a specific namespace in kubernetes cluster, it's not me who administrate the cluster. otherwise i will try the gc log option to see if can find something useful in order to debu

Re: Job manager crash

2021-09-09 Thread houssem
Hello , with respect to the api-server i dotn re On 2021/09/09 11:37:49, Yang Wang wrote: > I think @Robert Metzger is right. You need to check > whether your Kubernetes APIServer is working properly or not(e.g. > overloaded). > > Another hint is about the fullGC. Please use the following con

Re: Job manager crash

2021-09-09 Thread Yang Wang
I think @Robert Metzger is right. You need to check whether your Kubernetes APIServer is working properly or not(e.g. overloaded). Another hint is about the fullGC. Please use the following config option to enable the GC logs and check the full gc time. env.java.opts.jobmanager: -verbose:gc -XX:+

Re: Job manager crash

2021-09-09 Thread Robert Metzger
Is the kubernetes server you are using particularly busy? Maybe these issues occur because the server is overloaded? "Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job ." "Completed checkpoint 2193 for job (474 byt

Re: Job manager crash

2021-09-06 Thread houssem
hello, I have three jobs running on my kubernetes cluster and each job has his own cluster id. On 2021/09/06 03:28:10, Yangze Guo wrote: > Hi, > > The root cause is not "java.lang.NoClassDefFound". The job has been > running but could not edit the config map > "myJob-0

Re: Job manager crash

2021-09-05 Thread Yangze Guo
Hi, The root cause is not "java.lang.NoClassDefFound". The job has been running but could not edit the config map "myJob--jobmanager-leader" and it seems finally disconnected with the API server. Is there another job with the same cluster id (myJob) ? I would also

Re: Job manager crash

2021-09-05 Thread Caizhi Weng
Hi! There is a message saying "java.lang.NoClassDefFound Error: org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you visiting HDFS in your job? If yes it seems that your Flink distribution or your cluster is lacking hadoop classes. Please make sure that there are hadoop jars in the

Re: Should flink job manager crash during zookeeper upgrade?

2021-02-11 Thread Barisa Obradovic
Thank you Till, that's perfect. I increased the max retry attempts a bit, and now it works like a charm ( no restarts ). -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Should flink job manager crash during zookeeper upgrade?

2021-02-11 Thread Till Rohrmann
Hi Barisa, Could you give us the full logs of the run? It looks a bit that you exceeded the maximum retry attempts while you upgraded your ZooKeeper cluster. You can increase it via recovery.zookeeper.client.retry-wait and recovery.zookeeper.client.max-retry-attempts. >From Flink's perspective it

Re: Should flink job manager crash during zookeeper upgrade?

2021-02-10 Thread Barisa Obradovic
Great, thank you for help Matthias -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Should flink job manager crash during zookeeper upgrade?

2021-02-10 Thread Matthias Pohl
r server zdzk.servicexxx/192.168.190.92:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192] > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:

Should flink job manager crash during zookeeper upgrade?

2021-02-10 Thread Barisa Obradovic
sun.nio.ch.IOUtil.rea FYI: I've asked same question on stackoverflow: https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Recovery from job manager crash using check points

2019-08-21 Thread Zili Chen
tates with a save point directory? e.g. > > ./bin/flink run myJob.jar -s savepointDirectory > > > > Regards, > > > > Min > > > > *From:* Zili Chen [mailto:wander4...@gmail.com] > *Sent:* Dienstag, 20. August 2019 04:16 > *To:* Biao Liu > *Cc:* Ta

RE: Recovery from job manager crash using check points

2019-08-21 Thread min.tan
From: Zili Chen [mailto:wander4...@gmail.com] Sent: Dienstag, 20. August 2019 04:16 To: Biao Liu Cc: Tan, Min; user Subject: [External] Re: Recovery from job manager crash using check points Hi Min, I guess you use standalone high-availability and when TM fails, JM can recovered the job from an

Re: Recovery from job manager crash using check points

2019-08-19 Thread Zili Chen
Hi Min, I guess you use standalone high-availability and when TM fails, JM can recovered the job from an in-memory checkpoint store. However, when JM fails, since you don't persist state on ha backend such as ZooKeeper, even JM relaunched by YARN RM superseded by a stand by, the new one knows not

Re: Recovery from job manager crash using check points

2019-08-19 Thread Biao Liu
Hi Min, > Do I need to set up zookeepers to keep the states when a job manager crashes? I guess you need to set up the HA [1] properly. Besides that, I would suggest you should also check the state backend. 1. https://ci.apache.org/projects/flink/flink-docs-master/ops/jobmanager_high_availabilit

Re: Recovery from job manager crash using check points

2019-08-19 Thread miki haiat
Wich kind of deployment system are you using, Standalone ,yarn ... Other ? On Mon, Aug 19, 2019, 18:28 wrote: > Hi, > > > > I can use check points to recover Flink states when a task manger crashes. > > > > I can not use check points to recover Flink states when a job manger > crashes. > > > > D

Recovery from job manager crash using check points

2019-08-19 Thread min.tan
Hi, I can use check points to recover Flink states when a task manger crashes. I can not use check points to recover Flink states when a job manger crashes. Do I need to set up zookeepers to keep the states when a job manager crashes? Regards Min E-mails can involve SUBSTANTIAL RISKS, e.g. l