[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

Leonid Ilyevsky (Jira) Tue, 28 Jul 2020 07:23:55 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166458#comment-17166458
 ]


Leonid Ilyevsky commented on FLINK-18733:
-----------------------------------------

Hi [~trohrmann]

The addresses I am using are perfectly resolvable, if you are talking about IP 
address resolution on Linux level. In fact, it is the same set of machines. I 
am running Flink cluster on the same three machines where I am running my 
Zookeeper cluster.

The critical point is that when when I reverted back to version 1.10.0, the 
problem disappeared.

I don't think this problem has anything to do with the Linux hosts being 
resolved or not. As you can see in the error message, it happens inside a 
routine related to SASL, which I am not using and don't need.

 

You said it works when you use Flink's ZooKeeper support locally.  What exactly 
is it? A Zookeeper running inside Flink?

Then it fails when you configured it with "unresolvable 
{{high-availability.zookeeper.quorum}} address". Did you actually use 
unresolvable hosts, so you could not even ping them? Obviously such test would 
fail, no doubts.

 

Could you please perform the test closer to what I am doing? Run a simple 
Zookeeper cluster on the same machines where you run Flink.

 

I actually found the code where the exception is thrown: 
[https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/SaslServerPrincipal.java]
 . I guess, this is not the exact version that you are using, so the line 
numbers might differ.

First thing I noticed, in the comment it says "Get the name of the server 
principal for a SASL client. This is visible for *testing purposes*". So this 
is supposed to be called only during tests? Not sure what that means.

Then, inside the getServerPrincipal method, it retrieves the "canonicalize" 
flag, and apparently it got the value "true". Maybe this is the source of the 
issue? Maybe in Flink 1.10.0 it was "false" and there was no problem? I hope 
there should be some workaround, like set some system property and make that 
flag to be false.

 

Thanks,

 

Leonid

 

> Jobmanager cannot start in HA mode with Zookeeper
> -------------------------------------------------
>
>                 Key: FLINK-18733
>                 URL: https://issues.apache.org/jira/browse/FLINK-18733
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.1
>            Reporter: Leonid Ilyevsky
>            Priority: Major
>         Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/<unresolved>:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/<unresolved>:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

Reply via email to