I have a streaming Flink job that runs 24/7 on a Kubernetes cluster hosted
in AWS. Every few weeks or sometimes months, the job fails down with
network errors like the following error in the logs. This is with Flink
1.14.5.

Is there anything that I can do to help my application automatically retry
and recover from this type of error. Do newer versions of Flink possibly
make this issue any better?

org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer
19:13:57.893 [Flink Netty Server (0) Thread 0] ERROR
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue -
Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer
19:13:57.894 [Flink Netty Server (0) Thread 0] ERROR
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue -
Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer

I see several questions similar to this on stackoverflow with no helpful
answers.

Thank you for any help.

Reply via email to