Hi Maybe you need to dump memory and analyze the usage if there are no other obvious problems
Best, Shammon On Fri, Feb 17, 2023 at 10:41 AM Weihua Hu <huweihua....@gmail.com> wrote: > Hi, Meghajit > > What kind of session cluster are you using? Standalone or Native? > If it's standalone, maybe you can check if TaskManager with heavy gc is > running more tasks than others. If so, we can enable > "cluster.evenly-spread-out-slots=true" to balance tasks in all task > managers. > > Best, > Weihua > > > On Thu, Feb 16, 2023 at 10:52 PM Meghajit Mazumdar < > meghajit.mazum...@gojek.com> wrote: > >> Hello, >> >> We have a Flink session cluster deployment in Kubernetes of around 100 >> TaskManagers. It processes around 20-30 Kafka Source jobs at the moment. >> The jobs run are all using the same jar and only differ in the SQL query >> used and other UDFs. We are using the official flink:1.14.3 image. >> >> We observed that one specific task manager has been doing more garbage >> collection compared to the others, So much actually, that at a specific >> hour of the day, it pauses execution to do GC and thus causes huge consumer >> lag to build up. By garbage collection, I mean GC of the Young Generation. >> The old generation GC looks fine. >> >> We checked this in our other running Flink clusters and found that >> actually in most of them, this behaviour is being seen. In fact, there are >> always 2-3 TaskManagers which seem to be doing more GC than the others. >> >> Is this a known issue ? Our clusters run long running kafka source to >> kafka sink jobs, so wanted to know if this can happen because of that. >> >> Would appreciate any kind of guidance. >> -- >> *Regards,* >> *Meghajit* >> >