The shutdown happened after the massive IO wait. I don't use any state Checkpoints are disk based...
On Mon., Dec. 23, 2019, 1:42 a.m. Zhijiang, <wangzhijiang...@aliyun.com> wrote: > Hi John, > > Thanks for the positive comments of Flink usage. No matter at least-once > or exactly-once you used for checkpoint, it would never lose one message > during failure recovery. > > Unfortunatelly I can not visit the logs you posted. Generally speaking the > longer internal checkpoint would mean replaying more source data after > failure recovery. > In my experience the 5 seconds interval for checkpoint is too frequently > in my experience, and you might increase it to 1 minute or so. You can also > monitor how long will the checkpoint finish in your application, then you > can adjust the interval accordingly. > > Concerning of the node shutdown you mentioned, I am not quite sure whether > it is relevant to your short checkpoint interval. Do you config to use heap > state backend? The hs_err file really indicated that you job had > encountered the memory issue, then it is better to somehow increase your > task manager memory. But if you can analyze the dump hs_err file via some > profiler tool for checking the memory usage, it might be more helpful to > find the root cause. > > Best, > Zhijiang > > ------------------------------------------------------------------ > From:John Smith <java.dev....@gmail.com> > Send Time:2019 Dec. 21 (Sat.) 05:26 > To:user <user@flink.apache.org> > Subject:Flink task node shut it self off. > > Hi, using Flink 1.8.0 > > 1st off I must say Flink resiliency is very impressive, we lost a node and > never lost one message by using checkpoints and Kafka. Thanks! > > The cluster is a self hosted cluster and we use our own zookeeper cluster. > We have... > 3 zookeepers: 4 cpu, 8GB (each) > 3 job nodes: 4 cpu, 8GB (each) > 3 task nodes: 4 cpu, 8GB (each) > The nodes also share GlusterFS for storing savepoints and checkpoints, > GlusterFS is running on the same machines. > > Yesterday a node shut itself off we the following log messages... > - Stopping TaskExecutor akka.tcp://fl...@xxx.xxx.xxx.73 > :34697/user/taskmanager_0. > - Stop job leader service. > - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > - Shutting down TaskExecutorLocalStateStoresManager. > - Shutting down BLOB cache > - Shutting down BLOB cache > - removed file cache directory > /tmp/flink-dist-cache-4b60d79b-1cef-4ffb-8837-3a9c9a205000 > - I/O manager removed spill file directory > /tmp/flink-io-c9d01b92-2809-4a55-8ab3-6920487da0ed > - Shutting down the network environment and its components. > > Prior to the node shutting off we noticed massive IOWAIT of 140% and CPU > load 1minute of 15. And we also got an hs_err file which sais we should > increase the memory. > > I'm attaching the logs here: > https://www.dropbox.com/sh/vp1ytpguimiayw7/AADviCPED47QEy_4rHsGI1Nya?dl=0 > > I wonder if my 5 second checkpointing is too much for gluster. > > Any thoughts? > > > > > >