Large, complicated DNS infrastructure setups coupled with varying application 
behavior, failure recovery scenarios, and long lived, large scale systems that 
are slow to migrate can lead to surprising, undesirable behavior when 
interacting with java.net.InetAddress's implementation of name resolution.

Multiple in-flight requests for the same host using 
InetAddress.NameServiceAddresses [0] are serialized on a lock [1]. The first 
request calls the system resolver [2]. Each additional request made before the 
resolver returns with a response shares the same instance of a 
NameServiceAddresses, each blocking in InetAdress.NameServiceAddresses.get 
until it gets the lock. Once all in-flight queries have completed, the 
InetAddress.NameServiceAddresses dies.

One potential difficulty lies in when the resolver takes longer to complete 
than the frequency at which name resolution requests are made. Normally, we 
expect subsequent name resolution requests to take less time than the first 
request, but that might not be the case. When it isn't, failures can become 
amplified due to the serialization of all queries on the same host.

Imagine that your DNS infrastructure requires that you disable the 
networkaddress caches since the minimum enabled value is 1 second, which is too 
large. If the system resolver takes 1 second, and subsequent resolver calls 
also take 1 second, and queries are made at a rate of more than one a second, 
each query will take longer and longer, since each call needs to wait for the 
resolver to complete, plus the time it spends waiting for the lock. One can see 
how this might lead to service outages.

There are multiple potential ways to deal with the NameServiceAddresses lock.

One way is to provide finer grained expiry for the cache. Instead of providing 
the cache timeouts in integer seconds, provide some hook to provide them in 
milliseconds. Leaving caching enabled but providing smaller timeouts could help 
provide the right level of control to avoid some of these complicated issues at 
scale.

Another way is to simply have all in-flight queries return the same result. 
NameServiceAddresses instances would die more quickly: only the first call to 
getAddressesFromNameService would be made, the result would be cached in the 
object. Each subsequent call made during that time would quickly return the 
result cached in the NameServiceAddresses instance, and the instance would 
quickly die. New getByName requests would create NameServiceAddresses instances 
as usual.

Other larger scale rewrites are also possible, but given how battle-tested the 
current implementation is and the potential dangers here, I imagine something 
smaller might be more acceptable.

Thoughts?

-john

[0] 
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/InetAddress.java#L1689C8-L1695C15
[1] 
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/InetAddress.java#L1043
[2] 
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/net/InetAddress.java#L1060

Reply via email to