Hi Jan, Thanks for answering!
I'm pretty sure the reason is related to the problem that solr tries to connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode of the zookeeper ensemble. The connection I'm talking about is when ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181". After a while the connect fails but as said this takes a long time. I did not debug deeper as this already is jdk code then. The timings for the valid zookeeper addresses (i.e. those from the static configuration string) are listed later. What causes problems is the attempt to connect to 0.0.0.0:2181: /opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181 /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /zookeeper/config WARN - 2022-12-15 06:57:44.828; org.apache.solr.common.cloud.SolrZkClient; Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure, it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE' ACLs to Zookeeper nodes INFO - 2022-12-15 06:57:44.852; org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for client to connect to ZooKeeper Exception in thread "main" org.apache.solr.common.SolrException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper 0.0.0.0:2181 within 30000 ms at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120) at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260) Caused by: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper 0.0.0.0:2181 within 30000 ms at org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216) ... 3 more real 0m31.728s user 0m3.284s sys 0m0.226s Of course this will fail but this was not a problem before (solr 8.11.1). The call also failed but returned fast. Here the timings you are interested in for each of my 3 zookeeper nodes (adjusted to my setup). The interesting part are the results from fetching the /zookeeper/config as it shows the server configurations that include the "0.0.0.0" addresses: /opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181 /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /zookeeper/config server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181 server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181 server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181 version=0 real 0m0.810s user 0m3.142s sys 0m0.148s /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls /solr/live_nodes /solr/live_nodes (2) /solr/live_nodes/192.168.0.222:8983_solr (0) /solr/live_nodes/192.168.0.223:8983_solr (0) real 0m0.838s user 0m3.166s sys 0m0.210s /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /solr/configs/cms_20221214_142242/stopwords.txt # Licensed to the Apache Software Foundation (ASF) under one or more # ... real 0m0.836s user 0m3.121s sys 0m0.173s /opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181 /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /zookeeper/config server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181 server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181 server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181 version=0 real 0m0.843s user 0m3.300s sys 0m0.183s /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls /solr/live_nodes /solr/live_nodes (2) /solr/live_nodes/192.168.0.222:8983_solr (0) /solr/live_nodes/192.168.0.223:8983_solr (0) real 0m0.807s user 0m3.035s sys 0m0.164s /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /solr/configs/cms_20221214_142242/stopwords.txt # Licensed to the Apache Software Foundation (ASF) under one or more # ... real 0m0.859s user 0m3.354s sys 0m0.177s export ZK_HOST=192.168.0.2:2181 /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /zookeeper/config server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181 server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181 server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181 version=0 real 0m0.790s user 0m2.838s sys 0m0.154s /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls /solr/live_nodes /solr/live_nodes (2) /solr/live_nodes/192.168.0.222:8983_solr (0) /solr/live_nodes/192.168.0.223:8983_solr (0) real 0m0.861s user 0m3.201s sys 0m0.169s /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /solr/configs/cms_20221214_142242/stopwords.txt # Licensed to the Apache Software Foundation (ASF) under one or more # ... real 0m0.779s user 0m3.081s sys 0m0.184s Thanks, Michael On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <jan....@cominvent.com> wrote: > Hi, > > We always check how the zookeeper ensemble is configured, and this > check does not depend on whether dynamic reconfiguration is possible or > not, > it is simply to detect the common mistake that a 3 node ensemble is > addressed > with only one of the hosts in the static config, or with wrong host names. > > Sounds like your problem is not with how Solr talks to ZK, but in how you > have configured your network. You say > > > But this will cause the socket connect to block when resolving > > "0.0.0.0" which makes everything very slow. > > Can you elaborate on exactly which connection you are talking about > here, and why/where it is blocking? Can you perhaps attempt a few commands > from the command line to illustrate your point? > > Assuming you are on Linux, and have the 'time' command available, try this > > export ZK_HOST=my-zookeeper:2181 > time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get > /zookeeper/config > time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls /live_nodes > time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get > /configs/_default/stopwords.txt > > What kind of timings do you see? > > Jan > > > 14. des. 2022 kl. 13:23 skrev michael dürr <due...@gmail.com>: > > > > Hi, > > > > Since we have updated to Solr 9.1, the admin ui has become pretty slow. > > > > The problem is related to the fact that we run solr and the zookeeper > > ensemble dockerized. As we cannot bind zookeeper from docker to its > host's > > external ip address, we have to use "0.0.0.0" as the server address which > > causes problems when solr tries to get the zookeeper status (via > > /solr/admin/zookeeper/status) > > > > Some debugging showed that ZookeeperStatusHandler.getZkStatus() always > > tries to get the dynamic configuration from zookeeper in order to check > > whether it contains all hosts of solr's static zookeeper configuration > > string. But this will cause the socket connect to block when resolving > > "0.0.0.0" which makes everything very slow. > > > > The approach to check whether zookeeper allows for dynamic > reconfiguration > > is based on the existence of the znode /zookeeper/config which seems not > to > > be a good approach as this znode will exist even in case the zookeeper > > ensemble does not allow dynamic reconfiguration (reconfigEnabled=false). > > > > Can anybody suggest some simple action to avoid that blocking (i.e. the > > dynamic configuration check) in order to get the status request return > fast > > again? > > > > It would be nice to have a configuration parameter that disables this > check > > independent of the zookeeper ensemble status. Especially as > > reconfigEnabled=false is the default setting for zookeeper. > > > > Thanks, > > Michael > >