Hello List, We have a Flink job running reading a Kafka topic, then sending all messages with a SOAP call. We have had a situation where that SOAP call failed every time, causing the job to be RESTARTING every few seconds.
After a few hours Flink itself terminates with an OutOfMemoryError. This means that all flink jobs are now in trouble. I dumped the heap, and noticed that it was completely filled up with two things: - kafka metrics - HashMap nodes related to PublicSuffixMatcher, a part of Apache HttpClient. This leads me to believe that the restarting somehow retains references to some old failed classes/classloaders? Of course I will repair the root cause, the failing job, but I would also like to fix things so that Flink does not die when something like this happens. I can of course set things like the max number of retries but I do not like that: I rather have the thing retry indefinitely so that when stuff is repaired the job continues normally. I tried to find information about how Flink loads jobs but I could not make much of it. How can I ensure that Flink does not run out of memory like this? We're using Flink 1.1.1 and Kafka 0.9.0.1. Thanks for your time, Frits