Hi Dongwon, Thanks for sharing the logs and the metrics screenshots with us. Unfortunately, I think we need more information to further isolate the problem therefore I have a couple of suggestions.
1. Since you already set up PromQL can you also share the JVM memory statics i.e. DirectMemory consumption over time? I would be interested to see whether the consumption is slowly increasing until the OOM happens or if it spikes only during the failing checkpoint. 2. We suspect that the Kafka Sink is causing the problem. Can you try to run your pipeline with a simple DiscadingSink and see if the error keeps happening? Maybe another component in your pipeline allocates a lot of DirectMemory and only the FlinkKafkaProducer hides the problems because it is the first component hitting the threshold. Another option would be to test the new KafkaSink which was released with 1.14 and should replace the FlinkKafkaProducer. Best, Fabian