OOM on flink when job restarts a lot

Frits Jalvingh Thu, 30 Mar 2017 04:25:31 -0700

Hello List,

We have a Flink job running reading a Kafka topic, then sending all
messages with a SOAP call. We have had a situation where that SOAP call
failed every time, causing the job to be RESTARTING every few seconds.


After a few hours Flink itself terminates with an OutOfMemoryError. This
means that all flink jobs are now in trouble.

I dumped the heap, and noticed that it was completely filled up with two
things:
- kafka metrics
- HashMap nodes related to PublicSuffixMatcher, a part of Apache HttpClient.

This leads me to believe that the restarting somehow retains references to
some old failed classes/classloaders?

Of course I will repair the root cause, the failing job, but I would also
like to fix things so that Flink does not die when something like this
happens. I can of course set things like the max number of retries but I do
not like that: I rather have the thing retry indefinitely so that when
stuff is repaired the job continues normally.

I tried to find information about how Flink loads jobs but I could not make
much of it.

How can I ensure that Flink does not run out of memory like this?

We're using Flink 1.1.1 and Kafka 0.9.0.1.

Thanks for your time,

Frits

OOM on flink when job restarts a lot

Reply via email to