We're running Flink on a 5 node Flink cluster with two Job Managers and three Task Managers.
Of late, we're facing this issue where once every day or so, all three task managers get killed, making the number of available task slots 0 causing all the jobs running on that cluster to fail. The only resolution is to manually restart the Task Managers. So I wanted to know some of the typical reason that can bring down a Task Manager. And if there is a way to automatically bring them back up without manual intervention. Additional info: The jobs running on the cluster read data from Kafka and write data to Kafka/Cassandra. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/