Howdy,

Not sure whether to send this to dev@ or user@, so I'll try user@ first.

we've had a couple of instances of Solr not starting because a ZK
conncetion couldn't be made in time. "Could not connect to ZooKeeper
within 30000ms".

While debugging this, I noticed that there are two timeouts.
zkClientTimeout and zkClientConnectTimeout.

zkClientTimeout is passed to ZK and is used by ZK itself. This is fine
and is configurable.

zkClientConnectTimeout is used by Solr when creating a ZK connection: if
no connection can be made within zkClientConnectTimeout, Solr considers
ZK to be dead.

Where things get fishy is that zkClientConnectTimeout is hard coded in
ZkContainer.java. It is set to 30 seconds, *unless* you're running
*embedded* ZK with multiple ZKs -- then it is set to 24hours.

This basically means that if you're using an external ensemble, you're
screwed if the first couple of connection attempts fail.

Wouldn't it make more sense to set this value to $zkClientTimeout x
$numServers? Or to make it configurable outright?

Thanks,

 - Bram

Reply via email to