Large, complicated DNS infrastructure setups coupled with varying application behavior, failure recovery scenarios, and long lived, large scale systems that are slow to migrate can lead to surprising, undesirable behavior when interacting with java.net.InetAddress's implementation of name resolution.
Multiple in-flight requests for the same host using InetAddress.NameServiceAddresses [0] are serialized on a lock [1]. The first request calls the system resolver [2]. Each additional request made before the resolver returns with a response shares the same instance of a NameServiceAddresses, each blocking in InetAdress.NameServiceAddresses.get until it gets the lock. Once all in-flight queries have completed, the InetAddress.NameServiceAddresses dies. One potential difficulty lies in when the resolver takes longer to complete than the frequency at which name resolution requests are made. Normally, we expect subsequent name resolution requests to take less time than the first request, but that might not be the case. When it isn't, failures can become amplified due to the serialization of all queries on the same host. Imagine that your DNS infrastructure requires that you disable the networkaddress caches since the minimum enabled value is 1 second, which is too large. If the system resolver takes 1 second, and subsequent resolver calls also take 1 second, and queries are made at a rate of more than one a second, each query will take longer and longer, since each call needs to wait for the resolver to complete, plus the time it spends waiting for the lock. One can see how this might lead to service outages. There are multiple potential ways to deal with the NameServiceAddresses lock. One way is to provide finer grained expiry for the cache. Instead of providing the cache timeouts in integer seconds, provide some hook to provide them in milliseconds. Leaving caching enabled but providing smaller timeouts could help provide the right level of control to avoid some of these complicated issues at scale. Another way is to simply have all in-flight queries return the same result. NameServiceAddresses instances would die more quickly: only the first call to getAddressesFromNameService would be made, the result would be cached in the object. Each subsequent call made during that time would quickly return the result cached in the NameServiceAddresses instance, and the instance would quickly die. New getByName requests would create NameServiceAddresses instances as usual. Other larger scale rewrites are also possible, but given how battle-tested the current implementation is and the potential dangers here, I imagine something smaller might be more acceptable. Thoughts? -john [0] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/InetAddress.java#L1689C8-L1695C15 [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/InetAddress.java#L1043 [2] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/InetAddress.java#L1060