[
https://issues.apache.org/jira/browse/FLINK-39316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dominik Dębowczyk updated FLINK-39316:
--------------------------------------
Description:
{{BlobServer.getAddress()}} returns an unreachable address when the server is
bound to the wildcard address (0.0.0.0) and the machine's hostname resolves to
a non-local IP (e.g. a VPN address). This causes blob uploads to fail with
Connection reset during job submission.
h3. Root Cause
FLINK-38109 changed {{MiniCluster.createBlobServerAddress()}} from using the
Dispatcher's RPC hostname (typically localhost) to using
{{{}BlobServer.getAddress(){}}}. When the BlobServer binds to 0.0.0.0,
{{getAddress()}} falls back to {{InetAddress.getLocalHost()}}
{{InetAddress.getLocalHost()}} resolves the machine's hostname via DNS. On
machines with VPN software (e.g. corporate VPNs), the hostname can resolve to a
VPN-assigned IP that is not directly reachable on any local interface.
The TCP connection completes at the kernel level (routed through the VPN) but
the packets never reach the local BlobServer's accept queue. The BlobServer
never processes the request, and the client gets a Connection reset when
reading the response.
h3. How to Reproduce
Run any test that submits a job with user jars through the {{MiniCluster}} on a
machine where the hostname resolves to a non-local-interface IP:
{{mvn test -pl flink-table/flink-table-planner
-Dtest="FunctionITCase#testUsingAddJar"}}
Fails with:
{code:java}
Caused by: java.io.IOException: PUT operation failed: Connection reset
at o.a.f.runtime.blob.BlobClient.putInputStream(BlobClient.java:496)
at o.a.f.runtime.blob.BlobClient.uploadFile(BlobClient.java:545)
Caused by: java.net.SocketException: Connection reset
at
o.a.f.runtime.blob.BlobOutputStream.receiveAndCheckPutResponse(BlobOutputStream.java:175){code}
h3. Fix
Replace {{InetAddress.getLocalHost()}} with
{{InetAddress.getLoopbackAddress()}} in {{{}BlobServer.getAddress(){}}}. When a
server binds to 0.0.0.0 (all interfaces), the loopback address (127.0.0.1) is
always a valid way to reach it locally. This avoids the dependency on DNS
hostname resolution which is unreliable across different network configurations.
was:
{{BlobServer.getAddress()}} returns an unreachable address when the server is
bound to the wildcard address (0.0.0.0) and the machine's hostname resolves to
a non-local IP (e.g. a VPN address). This causes blob uploads to fail with
Connection reset during job submission.
h3. Root Cause
FLINK-38109 changed {{MiniCluster.createBlobServerAddress()}} from using the
Dispatcher's RPC hostname (typically localhost) to using
{{{}BlobServer.getAddress(){}}}. When the BlobServer binds to 0.0.0.0,
{{getAddress()}} falls back to {{InetAddress.getLocalHost()}}
{{InetAddress.getLocalHost()}} resolves the machine's hostname via DNS. On
machines with VPN software (e.g. corporate VPNs), the hostname can resolve to a
VPN-assigned IP that is not directly reachable on any local interface.
The TCP connection to 100.64.1.5 completes at the kernel level (routed through
the VPN) but the packets never reach the local BlobServer's accept queue. The
BlobServer never processes the request, and the client gets a Connection reset
when reading the response.
h3. How to Reproduce
Run any test that submits a job with user jars through the {{MiniCluster}} on a
machine where the hostname resolves to a non-local-interface IP:
{{mvn test -pl flink-table/flink-table-planner
-Dtest="FunctionITCase#testUsingAddJar"}}
Fails with:
{code}
Caused by: java.io.IOException: PUT operation failed: Connection reset
at o.a.f.runtime.blob.BlobClient.putInputStream(BlobClient.java:496)
at o.a.f.runtime.blob.BlobClient.uploadFile(BlobClient.java:545)
Caused by: java.net.SocketException: Connection reset
at
o.a.f.runtime.blob.BlobOutputStream.receiveAndCheckPutResponse(BlobOutputStream.java:175){code}
h3. Fix
Replace {{InetAddress.getLocalHost()}} with
{{InetAddress.getLoopbackAddress()}} in {{BlobServer.getAddress()}}. When a
server binds to 0.0.0.0 (all interfaces), the loopback address (127.0.0.1) is
always a valid way to reach it locally. This avoids the dependency on DNS
hostname resolution which is unreliable across different network configurations.
> BlobServer.getAddress() uses InetAddress.getLocalHost() which fails with VPN
> networking
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-39316
> URL: https://issues.apache.org/jira/browse/FLINK-39316
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 2.1.0, 2.2.0, 2.1.1
> Reporter: Dominik Dębowczyk
> Priority: Major
>
> {{BlobServer.getAddress()}} returns an unreachable address when the server is
> bound to the wildcard address (0.0.0.0) and the machine's hostname resolves
> to a non-local IP (e.g. a VPN address). This causes blob uploads to fail with
> Connection reset during job submission.
> h3. Root Cause
> FLINK-38109 changed {{MiniCluster.createBlobServerAddress()}} from using the
> Dispatcher's RPC hostname (typically localhost) to using
> {{{}BlobServer.getAddress(){}}}. When the BlobServer binds to 0.0.0.0,
> {{getAddress()}} falls back to {{InetAddress.getLocalHost()}}
> {{InetAddress.getLocalHost()}} resolves the machine's hostname via DNS. On
> machines with VPN software (e.g. corporate VPNs), the hostname can resolve to
> a VPN-assigned IP that is not directly reachable on any local interface.
> The TCP connection completes at the kernel level (routed through the VPN) but
> the packets never reach the local BlobServer's accept queue. The BlobServer
> never processes the request, and the client gets a Connection reset when
> reading the response.
> h3. How to Reproduce
> Run any test that submits a job with user jars through the {{MiniCluster}} on
> a machine where the hostname resolves to a non-local-interface IP:
> {{mvn test -pl flink-table/flink-table-planner
> -Dtest="FunctionITCase#testUsingAddJar"}}
> Fails with:
> {code:java}
> Caused by: java.io.IOException: PUT operation failed: Connection reset
> at o.a.f.runtime.blob.BlobClient.putInputStream(BlobClient.java:496)
> at o.a.f.runtime.blob.BlobClient.uploadFile(BlobClient.java:545)
> Caused by: java.net.SocketException: Connection reset
> at
> o.a.f.runtime.blob.BlobOutputStream.receiveAndCheckPutResponse(BlobOutputStream.java:175){code}
> h3. Fix
> Replace {{InetAddress.getLocalHost()}} with
> {{InetAddress.getLoopbackAddress()}} in {{{}BlobServer.getAddress(){}}}. When
> a server binds to 0.0.0.0 (all interfaces), the loopback address (127.0.0.1)
> is always a valid way to reach it locally. This avoids the dependency on DNS
> hostname resolution which is unreliable across different network
> configurations.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)