Hi Team, We hit an issue after we upgrade our job from Flink 1.12 to 1.15, there's a consistent akka.remote.OversizedPayloadException after job restarts:
Transient association error (association remains live) akka.remote.OversizedPayloadException: Discarding oversized payload sent to Actor[akka.tcp://flink@xxx/user/rpc/taskmanager_0#-311495648]: max allowed size 10485760 bytes, actual size of encoded class org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation was 33670549 bytes. In the job, We changed the kafka consumer from FlinkKafkaConsumer to the new KafkaSource, and we noticed there's a stackoverflow ( https://stackoverflow.com/questions/75363084/jobs-stuck-while-trying-to-restart-from-a-checkpoint ) talking about _metadata file size kept doubling after that change. We later checked the _metadata for our own job, it did increase a lot for each restart, (around 128 MB when we hit the akka error). I'd like to see if there's a known root cause for this problem and what can we do here to eliminate it? Best, Wei