Deng Liwen created FLINK-33406: ---------------------------------- Summary: Flink Job failed due to losing connection from ZK server Key: FLINK-33406 URL: https://issues.apache.org/jira/browse/FLINK-33406 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.14.3 Environment: Flink version: 1.14.3
Zookeeper version: 3.4.10 Reporter: Deng Liwen We are using Flink 1.14.3 and we faced an issue when losing connection from ZK server, the flink job connecting to the target ZK server will be failed directly. This case can be reproduced 100% when you kill the connected ZK server for simulating connection refused issue. Flink jobs connect to other running ZK server keep running as expected. The log output is: {code:java} org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: java.util.concurrent.TimeoutException at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246) at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1731) at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:123) at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:80) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1916) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)