I have a streaming Flink job that runs 24/7 on a Kubernetes cluster hosted in AWS. Every few weeks or sometimes months, the job fails down with network errors like the following error in the logs. This is with Flink 1.14.5.
Is there anything that I can do to help my application automatically retry and recover from this type of error. Do newer versions of Flink possibly make this issue any better? org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 19:13:57.893 [Flink Netty Server (0) Thread 0] ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 19:13:57.894 [Flink Netty Server (0) Thread 0] ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer I see several questions similar to this on stackoverflow with no helpful answers. Thank you for any help.