Hi Jan, Thanks for answering!
> I don't know how you run these zk's dockerized, but I'd look for a workaround where you can configure the correct address in zk's configuration. Then Solr will be happy. There exists a workaround where you can assign certain address ranges to a docker host to run each of its containers with an external ip the container can bind to: https://solr.apache.org/guide/solr/latest/deployment-guide/docker-networking.html Anyways, this is pretty cumbersome and not really necessary as the communication between zookeeper nodes perfectly works by locally binding to "0.0.0.0". The problem only occurs when using the admin ui, because of the fact that (as mentioned in my former email) solr tries to connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode of the zookeeper ensemble. That causes solr to call ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String fourLetterWordCommand) with "0.0.0.0:2181". This does not happen for any other scenarios as solr won't use "0.0.0.0" (dynamic configuration is disabled by default) but only the IPs from the configured zkHost string (which does not mention "0.0.0.0" but only valid IPs). So I'm just wondering why solr ui tries to use IPs read from /zookeeper/config znode even in case the zookeeper config does explicitly disable dynamic reconfiguration (reconfigEnabled=false). I'd expect solr to respect the zookeeper config and not try to resolve addresses other than those configured in zkHost string. > Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 returns immediately instead of after 30s? Yes, I tested it now with 8.11.1 and the call immediately returns an exception (trace below). After 30 seconds the call returns due to a timeout. But in contrast to solr 9.1. the call for 8.11 immediately returns an exception. So the solr ui also gets immediately responsive. root@solr1:/opt/solr# export ZK_HOST=0.0.0.0:2181 root@solr1:/opt/solr# time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get /zookeeper/config INFO - 2023-02-03 06:19:30.748; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper WARN - 2023-02-03 06:19:30.759; org.apache.zookeeper.ClientCnxn; Session 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. => java.net.ConnectException: Connection refused at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777) ~[?:?] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344) ~[zookeeper-3.6.2.jar:3.6.2] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) ~[zookeeper-3.6.2.jar:3.6.2] WARN - 2023-02-03 06:19:31.867; org.apache.zookeeper.ClientCnxn; Session 0x0 for sever 0.0.0.0/0.0.0.0:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. => === same exception multiple times and then... === WARN - 2023-02-03 06:20:01.692; org.apache.zookeeper.ClientCnxn; An exception was thrown while closing send thread for session 0x0. => java.net.ConnectException: Connection refused at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777) ~[?:?] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344) ~[zookeeper-3.6.2.jar:3.6.2] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) ~[zookeeper-3.6.2.jar:3.6.2] Exception in thread "main" org.apache.solr.common.SolrException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper 0.0.0.0:2181 within 30000 ms at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:195) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:119) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:109) at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:196) Caused by: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper 0.0.0.0:2181 within 30000 ms at org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:251) at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:186) ... 3 more real 0m31.693s user 0m2.375s sys 0m0.199s Thanks, Michael On Thu, Feb 2, 2023 at 12:32 PM Jan Høydahl <jan....@cominvent.com> wrote: > Hi, > > Following up on this. I'd still say that the issue here seems to be that > your zookeeper config lists 0.0.0.0 as ip address for client connections. > > >>> The problem is related to the fact that we run solr and the zookeeper > >>> ensemble dockerized. As we cannot bind zookeeper from docker to its > host's > >>> external ip address, we have to use "0.0.0.0" as the server address > > I don't know how you run these zk's dockerized, but I'd look for a > workaround where you can configure the correct address in zk's > configuration. Then Solr will be happy. > > Are you saying that in 8.11, the test with zkcli.sh to 0.0.0.0:2181 > returns immediately instead of after 30s? > > Jan > > > 15. des. 2022 kl. 07:10 skrev michael dürr <due...@gmail.com>: > > > > Hi Jan, > > > > Thanks for answering! > > > > I'm pretty sure the reason is related to the problem that solr tries to > > connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode > > of the zookeeper ensemble. > > The connection I'm talking about is when > > ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String > > fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181". > > After a while the connect fails but as said this takes a long time. I did > > not debug deeper as this already is jdk code then. > > > > The timings for the valid zookeeper addresses (i.e. those from the static > > configuration string) are listed later. What causes problems is the > attempt > > to connect to 0.0.0.0:2181: > > > > /opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181 > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd get /zookeeper/config > > WARN - 2022-12-15 06:57:44.828; > org.apache.solr.common.cloud.SolrZkClient; > > Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure, > > it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE' > > ACLs to Zookeeper nodes > > INFO - 2022-12-15 06:57:44.852; > > org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for > > client to connect to ZooKeeper > > Exception in thread "main" org.apache.solr.common.SolrException: > > java.util.concurrent.TimeoutException: Could not connect to ZooKeeper > > 0.0.0.0:2181 within 30000 ms > > at > > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225) > > at > > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137) > > at > > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120) > > at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260) > > Caused by: java.util.concurrent.TimeoutException: Could not connect to > > ZooKeeper 0.0.0.0:2181 within 30000 ms > > at > > > org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297) > > at > > org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216) > > ... 3 more > > > > real 0m31.728s > > user 0m3.284s > > sys 0m0.226s > > > > Of course this will fail but this was not a problem before (solr 8.11.1). > > The call also failed but returned fast. > > > > Here the timings you are interested in for each of my 3 zookeeper nodes > > (adjusted to my setup). The interesting part are the results from > fetching > > the /zookeeper/config as it shows the server configurations that include > > the "0.0.0.0" addresses: > > > > /opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181 > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd get /zookeeper/config > > server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181 > > server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181 > > server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181 > > version=0 > > > > real 0m0.810s > > user 0m3.142s > > sys 0m0.148s > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd ls /solr/live_nodes > > /solr/live_nodes (2) > > /solr/live_nodes/192.168.0.222:8983_solr (0) > > /solr/live_nodes/192.168.0.223:8983_solr (0) > > > > real 0m0.838s > > user 0m3.166s > > sys 0m0.210s > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd get /solr/configs/cms_20221214_142242/stopwords.txt > > # Licensed to the Apache Software Foundation (ASF) under one or more > > # ... > > > > real 0m0.836s > > user 0m3.121s > > sys 0m0.173s > > > > /opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181 > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd get /zookeeper/config > > server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181 > > server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181 > > server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181 > > version=0 > > > > real 0m0.843s > > user 0m3.300s > > sys 0m0.183s > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd ls /solr/live_nodes > > /solr/live_nodes (2) > > /solr/live_nodes/192.168.0.222:8983_solr (0) > > /solr/live_nodes/192.168.0.223:8983_solr (0) > > > > real 0m0.807s > > user 0m3.035s > > sys 0m0.164s > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd get /solr/configs/cms_20221214_142242/stopwords.txt > > # Licensed to the Apache Software Foundation (ASF) under one or more > > # ... > > > > real 0m0.859s > > user 0m3.354s > > sys 0m0.177s > > > > export ZK_HOST=192.168.0.2:2181 > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd get /zookeeper/config > > server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181 > > server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181 > > server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181 > > version=0 > > > > real 0m0.790s > > user 0m2.838s > > sys 0m0.154s > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd ls /solr/live_nodes > > /solr/live_nodes (2) > > /solr/live_nodes/192.168.0.222:8983_solr (0) > > /solr/live_nodes/192.168.0.223:8983_solr (0) > > > > real 0m0.861s > > user 0m3.201s > > sys 0m0.169s > > > > /opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST > > -cmd get /solr/configs/cms_20221214_142242/stopwords.txt > > # Licensed to the Apache Software Foundation (ASF) under one or more > > # ... > > > > real 0m0.779s > > user 0m3.081s > > sys 0m0.184s > > > > Thanks, > > Michael > > > > On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <jan....@cominvent.com> > wrote: > > > >> Hi, > >> > >> We always check how the zookeeper ensemble is configured, and this > >> check does not depend on whether dynamic reconfiguration is possible or > >> not, > >> it is simply to detect the common mistake that a 3 node ensemble is > >> addressed > >> with only one of the hosts in the static config, or with wrong host > names. > >> > >> Sounds like your problem is not with how Solr talks to ZK, but in how > you > >> have configured your network. You say > >> > >>> But this will cause the socket connect to block when resolving > >>> "0.0.0.0" which makes everything very slow. > >> > >> Can you elaborate on exactly which connection you are talking about > >> here, and why/where it is blocking? Can you perhaps attempt a few > commands > >> from the command line to illustrate your point? > >> > >> Assuming you are on Linux, and have the 'time' command available, try > this > >> > >> export ZK_HOST=my-zookeeper:2181 > >> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get > >> /zookeeper/config > >> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls > /live_nodes > >> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get > >> /configs/_default/stopwords.txt > >> > >> What kind of timings do you see? > >> > >> Jan > >> > >>> 14. des. 2022 kl. 13:23 skrev michael dürr <due...@gmail.com>: > >>> > >>> Hi, > >>> > >>> Since we have updated to Solr 9.1, the admin ui has become pretty slow. > >>> > >>> The problem is related to the fact that we run solr and the zookeeper > >>> ensemble dockerized. As we cannot bind zookeeper from docker to its > >> host's > >>> external ip address, we have to use "0.0.0.0" as the server address > which > >>> causes problems when solr tries to get the zookeeper status (via > >>> /solr/admin/zookeeper/status) > >>> > >>> Some debugging showed that ZookeeperStatusHandler.getZkStatus() always > >>> tries to get the dynamic configuration from zookeeper in order to check > >>> whether it contains all hosts of solr's static zookeeper configuration > >>> string. But this will cause the socket connect to block when resolving > >>> "0.0.0.0" which makes everything very slow. > >>> > >>> The approach to check whether zookeeper allows for dynamic > >> reconfiguration > >>> is based on the existence of the znode /zookeeper/config which seems > not > >> to > >>> be a good approach as this znode will exist even in case the zookeeper > >>> ensemble does not allow dynamic reconfiguration > (reconfigEnabled=false). > >>> > >>> Can anybody suggest some simple action to avoid that blocking (i.e. the > >>> dynamic configuration check) in order to get the status request return > >> fast > >>> again? > >>> > >>> It would be nice to have a configuration parameter that disables this > >> check > >>> independent of the zookeeper ensemble status. Especially as > >>> reconfigEnabled=false is the default setting for zookeeper. > >>> > >>> Thanks, > >>> Michael > >> > >> > >