Could you please configure a bigger memory to avoid OOM and use NMTracker[1] to figure out the memory usage categories?
[1]. https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html Best, Yang Dan Hill <quietgol...@gmail.com> 于2022年4月21日周四 07:42写道: > Hi. > > I upgraded to Flink v1.14.4 and now my Flink TaskManagers are being killed > by Kubernetes for exceeding the requested memory. My Flink TM is using an > extra ~5gb of memory over the tm.memory.process.size. > > Here are the flink-config values that I'm using > taskmanager.memory.process.size: 25600mb > # The default, 256mb, is too small. > taskmanager.memory.jvm-metaspace.size: 320mb > taskmanager.memory.network.fraction: 0.2 > taskmanager.memory.network.max: 2560m > > I'm requesting 26112Mi in my Kubernetes config (so there's some buffer). > > I re-read the Flink docs > <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup/> > on > setting memory. This seems like it should be fine. The diagrams and docs > show that process.size is used. > > If it helps, the TMs are failing in a round robin once every ~30 minutes > or so. This isn't an issue with Flink v1.12.3 but is an issue with Flink > v1.14.4. > > My text logs have a bunch of kafka connections in them. I don't know if > that's related to overallocating memory. > > ❯ kubectl -n flink-v1-14-4 get events > > LAST SEEN TYPE REASON OBJECT > MESSAGE > > 37m Warning Evicted pod/flink-taskmanager-3 > The node was low on resource: memory. Container taskmanager was using > 31457992Ki, which exceeds its request of 26112Mi. > > 37m Normal Killing pod/flink-taskmanager-3 > Stopping container taskmanager > > 37m Normal Scheduled pod/flink-taskmanager-3 > Successfully assigned > hipcamp-prod-metrics-flink-v1-14-4/flink-taskmanager-3 to > ip-10-12-104-15.ec2.internal > > 37m Normal Pulled pod/flink-taskmanager-3 > Container image "flink:1.14.4" already present on machine > > 37m Normal Created pod/flink-taskmanager-3 > Created container taskmanager > > 37m Normal Started pod/flink-taskmanager-3 > Started container taskmanager > > 37m Normal SuccessfulCreate statefulset/flink-taskmanager > create Pod flink-taskmanager-3 in StatefulSet flink-taskmanager successful > > 37m Warning RecreatingFailedPod statefulset/flink-taskmanager > StatefulSet hipcamp-prod-metrics-flink-v1-14-4/flink-taskmanager is > recreating failed Pod flink-taskmanager-3 > > 37m Normal SuccessfulDelete statefulset/flink-taskmanager > delete Pod flink-taskmanager-3 in StatefulSet flink-taskmanager successful >