Could you please configure a bigger memory to avoid OOM and use
NMTracker[1] to figure out the memory usage categories?

[1].
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html

Best,
Yang

Dan Hill <quietgol...@gmail.com> 于2022年4月21日周四 07:42写道:

> Hi.
>
> I upgraded to Flink v1.14.4 and now my Flink TaskManagers are being killed
> by Kubernetes for exceeding the requested memory.  My Flink TM is using an
> extra ~5gb of memory over the tm.memory.process.size.
>
> Here are the flink-config values that I'm using
>     taskmanager.memory.process.size: 25600mb
>     # The default, 256mb, is too small.
>     taskmanager.memory.jvm-metaspace.size: 320mb
>     taskmanager.memory.network.fraction: 0.2
>     taskmanager.memory.network.max: 2560m
>
> I'm requesting 26112Mi in my Kubernetes config (so there's some buffer).
>
> I re-read the Flink docs
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup/>
>  on
> setting memory.  This seems like it should be fine.  The diagrams and docs
> show that process.size is used.
>
> If it helps, the TMs are failing in a round robin once every ~30 minutes
> or so.  This isn't an issue with Flink v1.12.3 but is an issue with Flink
> v1.14.4.
>
> My text logs have a bunch of kafka connections in them.  I don't know if
> that's related to overallocating memory.
>
> ❯ kubectl -n flink-v1-14-4 get events
>
> LAST SEEN   TYPE      REASON                OBJECT
>   MESSAGE
>
> 37m         Warning   Evicted               pod/flink-taskmanager-3
>   The node was low on resource: memory. Container taskmanager was using
> 31457992Ki, which exceeds its request of 26112Mi.
>
> 37m         Normal    Killing               pod/flink-taskmanager-3
>   Stopping container taskmanager
>
> 37m         Normal    Scheduled             pod/flink-taskmanager-3
>   Successfully assigned
> hipcamp-prod-metrics-flink-v1-14-4/flink-taskmanager-3 to
> ip-10-12-104-15.ec2.internal
>
> 37m         Normal    Pulled                pod/flink-taskmanager-3
>   Container image "flink:1.14.4" already present on machine
>
> 37m         Normal    Created               pod/flink-taskmanager-3
>   Created container taskmanager
>
> 37m         Normal    Started               pod/flink-taskmanager-3
>   Started container taskmanager
>
> 37m         Normal    SuccessfulCreate      statefulset/flink-taskmanager
> create Pod flink-taskmanager-3 in StatefulSet flink-taskmanager successful
>
> 37m         Warning   RecreatingFailedPod   statefulset/flink-taskmanager
> StatefulSet hipcamp-prod-metrics-flink-v1-14-4/flink-taskmanager is
> recreating failed Pod flink-taskmanager-3
>
> 37m         Normal    SuccessfulDelete      statefulset/flink-taskmanager
> delete Pod flink-taskmanager-3 in StatefulSet flink-taskmanager successful
>

Reply via email to