On Sun, 18 Sep 2022 11:52:28 GMT, Jaikiran Pai <j...@openjdk.org> wrote:

> Can I please get a review of this test only change which proposes to fix the 
> recent intermittent failures in `RmiBootstrapTest` reported in 
> https://bugs.openjdk.org/browse/JDK-8030616?
> 
> The test has been intermittently failing with `cannot find a free port after 
> 10 tries`.  The tests uses the `jdk.test.lib.Utils.getFreePort()` utility 
> method to get a free port to use in the tests. That port is then used to 
> create and bind a JMX connector server. The issue resides in the fact that 
> the `getFreePort` utility uses loopback address to identify a free port, 
> whereas the JMX connector server uses that identified port to bind to a 
> non-loopback address - there's logic in `sun.rmi.transport.tcp.TCPEndpoint` 
> line 117 which calls `InetAddress.getLocalHost()` which can and does return a 
> non-loopback address. This effectively means that the port that was 
> identified as free (on loopback) may not really be free on (some other 
> address) and a subsequent attempt to bind against it by the connector server 
> will end up with a `BindException`.
> 
> Data collected in failures on the CI system has shown that this is indeed the 
> case and the port that was chosen (for loopback) as free was already used by 
> another process on a different address. The test additionally has logic to 
> attempt retries upto a maximum of 10 times to find a new free port on such 
> `BindException`. Turns out the next free port (on loopback) returned by 
> `jdk.test.lib.Utils.getFreePort()` is incremental and it so happens that the 
> systems where this failed had a process listening on a range of 10 to 12 
> ports, so these too ended up with `BindException` when the JMX connector 
> server used that port for a different address.
> 
> The commit here makes sure that the JMX connector server in the test is now 
> configured to loopback address as the address to bind to. That way the test 
> uses the correct address (loopback) on which the port was free.
> 
> The commit also has a change to the javadoc of the test utility 
> `jdk.test.lib.Utils.getFreePort()` to clarify that it returns a free port on 
> loopback address. This now matches what the implementation of that method 
> does.
> 
> Multiple test runs after this change hasn't yet run into the failure.

While the change is reasonable, I’m not sure about your rationale.
The contributing factors to the intermittent failures are due to the flawed 
ephemeral port allocation strategy, that incurs a race condition to use the 
selected  ephemeral port within the test before the OS uses it. 
So from the time that the port has been acquired, then released via the 
autoclose of the serversocket, 
and then reused in the test, it is possible that the OS may have also relocated 
that ephemeral port.

This change doesn’t alter this race condition.

The are a number of significant attributes to the port allocation strategy:
The port allocation strategy acquires an ephemeral port from the OS while 
binding it to a ServerSocket. 
The significance of ServerSocket is it surreptitiously sets SO_REUSEADDR socket 
option on the server socket (Net.socket0() == Java_sun_nio_ch_Net_socket0). The 
fact that SO_REUSEADDR is set is important in this scenario.

The objective with using SO_REUSEADDR on a serversocket is allow for efficient 
server restarts due to
a shutdown for whatever reason. With SO_REUSEADDR set the restarting server, 
which typically uses a well known port, won’t get a BindException, while the 
port may still be associated with the “terminating” server i.e. it may not have 
been fully released by the OS, for general reuse.

Thus, this has a number of favourable consequences with your proposed change, 
as the address binding is the
same as that used for acquiring the port, so it is essential a “server restart 
scenario”, albeit with an
Ephemeral port. As such, there still exists the possibility that the port will 
be reused by the OS, where
conditions of heavy ephemeral port allocation is occurring.

As such, the change may reduce the rate of intermittent failure, but it won’t 
solve the issue fully.
There is a possibility that it may increase the rate of failure as many network 
tests now use
Loopback address rather than the local host address i.e. the primary configured 
IP address.
This will depend on the load on a test system, and the amount of concurrently 
executing network tests,
with the corresponding high volume of ephemeral port allocation.

Nonetheless, it is a reasonable change, provided there is no subtle change in 
test semantics. I think it
will ameliorate the intermittent failures.

-------------

PR: https://git.openjdk.org/jdk/pull/10322

Reply via email to