Hi Dongwon,

Thanks for sharing the logs and the metrics screenshots with us. Unfortunately,
I think we need more information to further isolate the problem therefore I have
a couple of suggestions.

1. Since you already set up PromQL can you also share the JVM memory statics
i.e. DirectMemory consumption over time? I would be interested to see whether
the consumption is slowly increasing until the OOM happens or if it spikes
only during the failing checkpoint.

2. We suspect that the Kafka Sink is causing the problem. Can you try to run
your pipeline with a simple DiscadingSink and see if the error keeps happening?
Maybe another component in your pipeline allocates a lot of DirectMemory and
only the FlinkKafkaProducer hides the problems because it is the first component
hitting the threshold.
Another option would be to test the new KafkaSink which was released with 1.14
and should replace the FlinkKafkaProducer.

Best,
Fabian

Reply via email to