Hi, using Flink 1.8.0 1st off I must say Flink resiliency is very impressive, we lost a node and never lost one message by using checkpoints and Kafka. Thanks!
The cluster is a self hosted cluster and we use our own zookeeper cluster. We have... 3 zookeepers: 4 cpu, 8GB (each) 3 job nodes: 4 cpu, 8GB (each) 3 task nodes: 4 cpu, 8GB (each) The nodes also share GlusterFS for storing savepoints and checkpoints, GlusterFS is running on the same machines. Yesterday a node shut itself off we the following log messages... - Stopping TaskExecutor akka.tcp://fl...@xxx.xxx.xxx.73 :34697/user/taskmanager_0. - Stop job leader service. - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. - Shutting down TaskExecutorLocalStateStoresManager. - Shutting down BLOB cache - Shutting down BLOB cache - removed file cache directory /tmp/flink-dist-cache-4b60d79b-1cef-4ffb-8837-3a9c9a205000 - I/O manager removed spill file directory /tmp/flink-io-c9d01b92-2809-4a55-8ab3-6920487da0ed - Shutting down the network environment and its components. Prior to the node shutting off we noticed massive IOWAIT of 140% and CPU load 1minute of 15. And we also got an hs_err file which sais we should increase the memory. I'm attaching the logs here: https://www.dropbox.com/sh/vp1ytpguimiayw7/AADviCPED47QEy_4rHsGI1Nya?dl=0 I wonder if my 5 second checkpointing is too much for gluster. Any thoughts?