> Wouldn't it make more sense to set this value to $zkClientTimeout x $numServers?
I don't think it would make much sense to calculate it in such a method. The connection speed to the Zookeeper does not scale linearly with respect to number of server(at least not for reasonable numbers). But I do agree that it should be configurability, interestingly enough, there was a ticket on this in Jira ( https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-7561) from *6* years ago. In there they wrote: > Note that unlike other configuration stored in ZK, this probably needs to be bootstrapped via a System property. I'm not completely sure why they wrote it, if anyone knows please let me know :), apart from this detail it should be pretty straight forward to implement this On Mon, Jul 12, 2021, 5:30 PM Bram Van Dam <bram.van...@intix.eu> wrote: > Howdy, > > Not sure whether to send this to dev@ or user@, so I'll try user@ first. > > we've had a couple of instances of Solr not starting because a ZK > conncetion couldn't be made in time. "Could not connect to ZooKeeper > within 30000ms". > > While debugging this, I noticed that there are two timeouts. > zkClientTimeout and zkClientConnectTimeout. > > zkClientTimeout is passed to ZK and is used by ZK itself. This is fine > and is configurable. > > zkClientConnectTimeout is used by Solr when creating a ZK connection: if > no connection can be made within zkClientConnectTimeout, Solr considers > ZK to be dead. > > Where things get fishy is that zkClientConnectTimeout is hard coded in > ZkContainer.java. It is set to 30 seconds, *unless* you're running > *embedded* ZK with multiple ZKs -- then it is set to 24hours. > > This basically means that if you're using an external ensemble, you're > screwed if the first couple of connection attempts fail. > > Wouldn't it make more sense to set this value to $zkClientTimeout x > $numServers? Or to make it configurable outright? > > Thanks, > > - Bram >