I have a Flink 1.7 cluster using the "flink:1.7.2" (OpenJDK build
1.8.0_222-b10) image on Kubernetes.

As part of a MasterRestoreHook (for checkpointing) the JobManager needs to
communicate with an external security service.  This all works well until
there's a DNS lookup failure (due to network issues) at which point the
JobManager JVM seems unable to ever successfully look up the name again,
even when it's confirmed DNS service has been restored.  The weird thing is
that I can use kubectl to exec into the JobManager POD and successfully
perform a lookup even while the JobManager JVM is still failing to lookup.

Has anybody seen an issue like this before, or have any suggestions?  As
far as I'm aware Flink doesn't install a SecurityManager and therefore the
JVM should only cache invalid name requests for 10 seconds.

Restarting the JobManager JVM does successfully recover the Job, but I'd
like to avoid having to do that if possible.

Caused by: java.net.UnknownHostException: <********>.com: Temporary failure
in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)

Thanks in advance,

David

Reply via email to