[ https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165787#comment-17165787 ]
Leonid Ilyevsky commented on FLINK-18733: ----------------------------------------- [~trohrmann] please see two logs from that run, and also modified config file that I used. Actually both jobmanager and taskmanager had this error. Before submitting this report, I did some little research, looking at Flink code. To me it seems like it is related to SASL; it tries to do something with Zookeeper cluster host addresses as if it has security enabled. In my case I don't have any security on Zookeeper, so your Zookeeper client should not even go there. I tried to explicitly disable it by setting 'zookeeper.sasl.disable' to 'true', but it did not help. I believe you can easily reproduce this issue; it is not even important whether you have Zookeeper cluster running, because the client didn't even get to the point of actual connection. Please see if there is any workaround. > Jobmanager cannot start in HA mode with Zookeeper > ------------------------------------------------- > > Key: FLINK-18733 > URL: https://issues.apache.org/jira/browse/FLINK-18733 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.1 > Reporter: Leonid Ilyevsky > Priority: Major > Attachments: flink-conf.yaml, > flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, > flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log > > > When configured in HA mode, the Jobmanager cannot start at all. First, it > issues warnings like this: > {quote}{{2020-07-27 08:58:23,197 WARN > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - > Session 0x0 for server *nj1dvloglab01.liquidnet.biz/<unresolved>:2181*, > unexpected error, closing socket connection and attempting reconnect}} > {{java.lang.IllegalArgumentException: *Unable to canonicalize address* > nj1dvloglab01.liquidnet.biz/<unresolved>:2181 because it's not resolvable}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001) > ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {{ at > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060) > [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}} > {quote} > After few attempts connecting to Zookeeper, it finally fails: > {quote}2020-07-27 08:59:35,055 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Unhandled error in > ZooKeeperLeaderElectionService: Ensure path threw exception > at > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) > ~[flink-dist_2.12-1.11.1.jar:1.11.1] > {quote} > > The same HA configuration works fine for me in Flink 1.10.0. > -- This message was sent by Atlassian Jira (v8.3.4#803005)