Hi Team,

We hit an issue after we upgrade our job from Flink 1.12 to 1.15,  there's
a consistent akka.remote.OversizedPayloadException after job restarts:

Transient association error (association remains live)
akka.remote.OversizedPayloadException: Discarding oversized payload sent to
Actor[akka.tcp://flink@xxx/user/rpc/taskmanager_0#-311495648]: max allowed
size 10485760 bytes, actual size of encoded class
org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation was 33670549
bytes.


In the job, We changed the kafka consumer from FlinkKafkaConsumer to the
new KafkaSource, and we noticed there's a stackoverflow (
https://stackoverflow.com/questions/75363084/jobs-stuck-while-trying-to-restart-from-a-checkpoint
)  talking about _metadata file size kept doubling after that change.

We later checked the _metadata for our own job, it did increase a lot for
each restart, (around 128 MB when we hit the akka error). I'd like to see
if there's a known root cause for this problem and what can we do here to
eliminate it?


Best,
Wei

Reply via email to