Hi, Meghajit What kind of session cluster are you using? Standalone or Native? If it's standalone, maybe you can check if TaskManager with heavy gc is running more tasks than others. If so, we can enable "cluster.evenly-spread-out-slots=true" to balance tasks in all task managers.
Best, Weihua On Thu, Feb 16, 2023 at 10:52 PM Meghajit Mazumdar < meghajit.mazum...@gojek.com> wrote: > Hello, > > We have a Flink session cluster deployment in Kubernetes of around 100 > TaskManagers. It processes around 20-30 Kafka Source jobs at the moment. > The jobs run are all using the same jar and only differ in the SQL query > used and other UDFs. We are using the official flink:1.14.3 image. > > We observed that one specific task manager has been doing more garbage > collection compared to the others, So much actually, that at a specific > hour of the day, it pauses execution to do GC and thus causes huge consumer > lag to build up. By garbage collection, I mean GC of the Young Generation. > The old generation GC looks fine. > > We checked this in our other running Flink clusters and found that > actually in most of them, this behaviour is being seen. In fact, there are > always 2-3 TaskManagers which seem to be doing more GC than the others. > > Is this a known issue ? Our clusters run long running kafka source to > kafka sink jobs, so wanted to know if this can happen because of that. > > Would appreciate any kind of guidance. > -- > *Regards,* > *Meghajit* >