Hi Jan,

Thanks for answering!

I'm pretty sure the reason is related to the problem that solr tries to
connect to "0.0.0.0" as it reads that IP from the /zookeeper/config znode
of the zookeeper ensemble.
The connection I'm talking about is when
ZookeeperStatusHandler.getZkRawResponse(String zkHostPort, String
fourLetterWordCommand) tries to open a Socket to "0.0.0.0:2181".
After a while the connect fails but as said this takes a long time. I did
not debug deeper as this already is jdk code then.

The timings for the valid zookeeper addresses (i.e. those from the static
configuration string) are listed later. What causes problems is the attempt
to connect to 0.0.0.0:2181:

/opt/solr-9.1.0$ export ZK_HOST=0.0.0.0:2181
/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd get /zookeeper/config
WARN  - 2022-12-15 06:57:44.828; org.apache.solr.common.cloud.SolrZkClient;
Using default ZkCredentialsInjector. ZkCredentialsInjector is not secure,
it creates an empty list of credentials which leads to 'OPEN_ACL_UNSAFE'
ACLs to Zookeeper nodes
INFO  - 2022-12-15 06:57:44.852;
org.apache.solr.common.cloud.ConnectionManager; Waiting up to 30000ms for
client to connect to ZooKeeper
Exception in thread "main" org.apache.solr.common.SolrException:
java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
0.0.0.0:2181 within 30000 ms
        at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:225)
        at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:137)
        at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120)
        at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:260)
Caused by: java.util.concurrent.TimeoutException: Could not connect to
ZooKeeper 0.0.0.0:2181 within 30000 ms
        at
org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:297)
        at
org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:216)
        ... 3 more

real    0m31.728s
user    0m3.284s
sys     0m0.226s

Of course this will fail but this was not a problem before (solr 8.11.1).
The call also failed but returned fast.

Here the timings you are interested in for each of my 3 zookeeper nodes
(adjusted to my setup). The interesting part are the results from fetching
the /zookeeper/config as it shows the server configurations that include
the "0.0.0.0" addresses:

/opt/solr-9.1.0$ export ZK_HOST=192.168.0.109:2181

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd get /zookeeper/config
server.1=0.0.0.0:2888:3888:participant;0.0.0.0:2181
server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
version=0

real    0m0.810s
user    0m3.142s
sys     0m0.148s

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd ls /solr/live_nodes
/solr/live_nodes (2)
 /solr/live_nodes/192.168.0.222:8983_solr (0)
 /solr/live_nodes/192.168.0.223:8983_solr (0)

real    0m0.838s
user    0m3.166s
sys     0m0.210s

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd get /solr/configs/cms_20221214_142242/stopwords.txt
# Licensed to the Apache Software Foundation (ASF) under one or more
# ...

real    0m0.836s
user    0m3.121s
sys     0m0.173s

/opt/solr-9.1.0$ export ZK_HOST=192.168.0.126:2181

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd get /zookeeper/config
server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
server.2=0.0.0.0:2888:3888:participant;0.0.0.0:2181
server.3=192.168.0.2:2888:3888:participant;0.0.0.0:2181
version=0

real    0m0.843s
user    0m3.300s
sys     0m0.183s

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd ls /solr/live_nodes
/solr/live_nodes (2)
 /solr/live_nodes/192.168.0.222:8983_solr (0)
 /solr/live_nodes/192.168.0.223:8983_solr (0)

real    0m0.807s
user    0m3.035s
sys     0m0.164s

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd get /solr/configs/cms_20221214_142242/stopwords.txt
# Licensed to the Apache Software Foundation (ASF) under one or more
# ...

real    0m0.859s
user    0m3.354s
sys     0m0.177s

export ZK_HOST=192.168.0.2:2181

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd get /zookeeper/config
server.1=192.168.0.109:2888:3888:participant;0.0.0.0:2181
server.2=192.168.0.126:2888:3888:participant;0.0.0.0:2181
server.3=0.0.0.0:2888:3888:participant;0.0.0.0:2181
version=0

real    0m0.790s
user    0m2.838s
sys     0m0.154s

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd ls /solr/live_nodes
/solr/live_nodes (2)
 /solr/live_nodes/192.168.0.222:8983_solr (0)
 /solr/live_nodes/192.168.0.223:8983_solr (0)

real    0m0.861s
user    0m3.201s
sys     0m0.169s

/opt/solr-9.1.0$ time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST
-cmd get /solr/configs/cms_20221214_142242/stopwords.txt
# Licensed to the Apache Software Foundation (ASF) under one or more
# ...

real    0m0.779s
user    0m3.081s
sys     0m0.184s

Thanks,
Michael

On Wed, Dec 14, 2022 at 10:08 PM Jan Høydahl <jan....@cominvent.com> wrote:

> Hi,
>
> We always check how the zookeeper ensemble is configured, and this
> check does not depend on whether dynamic reconfiguration is possible or
> not,
> it is simply to detect the common mistake that a 3 node ensemble is
> addressed
> with only one of the hosts in the static config, or with wrong host names.
>
> Sounds like your problem is not with how Solr talks to ZK, but in how you
> have configured your network. You say
>
> > But this will cause the socket connect to block when resolving
> > "0.0.0.0" which makes everything very slow.
>
> Can you elaborate on exactly which connection you are talking about
> here, and why/where it is blocking? Can you perhaps attempt a few commands
> from the command line to illustrate your point?
>
> Assuming you are on Linux, and have the 'time' command available, try this
>
> export ZK_HOST=my-zookeeper:2181
> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
> /zookeeper/config
> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd ls /live_nodes
> time server/scripts/cloud-scripts/zkcli.sh -z $ZK_HOST -cmd get
> /configs/_default/stopwords.txt
>
> What kind of timings do you see?
>
> Jan
>
> > 14. des. 2022 kl. 13:23 skrev michael dürr <due...@gmail.com>:
> >
> > Hi,
> >
> > Since we have updated to Solr 9.1, the admin ui has become pretty slow.
> >
> > The problem is related to the fact that we run solr and the zookeeper
> > ensemble dockerized. As we cannot bind zookeeper from docker to its
> host's
> > external ip address, we have to use "0.0.0.0" as the server address which
> > causes problems when solr tries to get the zookeeper status (via
> > /solr/admin/zookeeper/status)
> >
> > Some debugging showed that ZookeeperStatusHandler.getZkStatus() always
> > tries to get the dynamic configuration from zookeeper in order to check
> > whether it contains all hosts of solr's static zookeeper configuration
> > string. But this will cause the socket connect to block when resolving
> > "0.0.0.0" which makes everything very slow.
> >
> > The approach to check whether zookeeper allows for dynamic
> reconfiguration
> > is based on the existence of the znode /zookeeper/config which seems not
> to
> > be a good approach as this znode will exist even in case the zookeeper
> > ensemble does not allow dynamic reconfiguration (reconfigEnabled=false).
> >
> > Can anybody suggest some simple action to avoid that blocking (i.e. the
> > dynamic configuration check) in order to get the status request return
> fast
> > again?
> >
> > It would be nice to have a configuration parameter that disables this
> check
> > independent of the zookeeper ensemble status. Especially as
> > reconfigEnabled=false is the default setting for zookeeper.
> >
> > Thanks,
> > Michael
>
>

Reply via email to