[
https://issues.apache.org/jira/browse/GEODE-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524636#comment-17524636
]
Donal Evans commented on GEODE-9880:
------------------------------------
Some preliminary findings and questions following investigation of this issue
and talking with [~burcham], who knows membership code better than Patrick or
me:
On the client, if we have a locator with only an IP address defined and the
same locator is returned in the locator response with only a hostname defined,
then it is not possible to detect the duplicate without either a forward or
reverse lookup using DNS. Because of this, there is no way to prevent the
hostname-only locator from being added to the list of locators on the client
and then being used and causing the NPE first described.
If hostname-for-clients is configured and set to be an IP address, we follow
the code path shown in [the stack trace in this
comment|https://issues.apache.org/jira/browse/GEODE-9880?focusedCommentId=17460501&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17460501].
SNIHostName requires a valid domain name to be passed into the constructor in
SocketCreator. We attempt to resolve IP addresses to hostnames prior to
invoking the SNIHostName constructor, but if we can't, then we use the IP
address as a hostname. For IPv4, this succeeds, because the format of an IPv4
address is the same as the format of a valid domain name (characters separated
by periods), and so we're able to create the SNIHostName and set it (even
though we may not be using SNI). For IPv6, the constructor will throw, as seen
in the above stack trace.
>From 1.14 onward, the code in both these areas has been reworked
>significantly, so it appears that the originally described NPE may not be
>possible, although the client may still be unable to contact the locator or
>hit an exception elsewhere.
Questions:
Should we make the SNIHostName use conditional on whether you're actually using
SNI? This might allow the hostname-for-clients workaround to work for IPv6
environments, but might not solve the problem if the user wanted to use SNI
*and* could not resolve hostnames to IP addresses or vice versa on the client.
Should working name resolution be required in all cases? Is it a valid
configuration of Geode to allow clients to connect to a cluster without being
able to access the DNS used by members of the cluster?
> Cluster with multiple locators in an environment with no host name
> resolution, leads to null pointer exception
> --------------------------------------------------------------------------------------------------------------
>
> Key: GEODE-9880
> URL: https://issues.apache.org/jira/browse/GEODE-9880
> Project: Geode
> Issue Type: Bug
> Components: locator, membership
> Affects Versions: 1.12.5
> Reporter: Tigran Ghahramanyan
> Assignee: Patrick Johnsn
> Priority: Major
> Labels: blocks-1.12.10, blocks-1.15.0, membership,
> pull-request-available
>
> In our use case we have two locators that are initially configured with IP
> addresses, but _AutoConnectionSourceImpl.UpdateLocatorList()_ flow keeps on
> adding their corresponding host names to the locators list, while these host
> names are not resolvable.
> Later in {_}AutoConnectionSourceImpl.queryLocators(){_}, whenever a client
> tries to use such non resolvable host name to connect to a locator it tries
> to establish a connection to {_}socketaddr=0.0.0.0{_}, as written in
> {_}SocketCreator.connect(){_}. Which seems strange.
> Then, if there is no locator running on the same host, the next locator in
> the list is contacted, until reaching a locator contact configured with IP
> address - which succeeds eventually.
> But, when there happens to be a locator listening on the same host, then we
> have a null pointer exception in the second line below, because _inetadd=null_
> _socket.connect(sockaddr, Math.max(timeout, 0)); // sockaddr=0.0.0.0,
> connects to a locator listening on the same host_
> _configureClientSSLSocket(socket, inetadd.getHostName(), timeout); // inetadd
> = null_
>
> As a result, the cluster comes to a failed state, unable to recover.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)