Howdy, Not sure whether to send this to dev@ or user@, so I'll try user@ first.
we've had a couple of instances of Solr not starting because a ZK conncetion couldn't be made in time. "Could not connect to ZooKeeper within 30000ms". While debugging this, I noticed that there are two timeouts. zkClientTimeout and zkClientConnectTimeout. zkClientTimeout is passed to ZK and is used by ZK itself. This is fine and is configurable. zkClientConnectTimeout is used by Solr when creating a ZK connection: if no connection can be made within zkClientConnectTimeout, Solr considers ZK to be dead. Where things get fishy is that zkClientConnectTimeout is hard coded in ZkContainer.java. It is set to 30 seconds, *unless* you're running *embedded* ZK with multiple ZKs -- then it is set to 24hours. This basically means that if you're using an external ensemble, you're screwed if the first couple of connection attempts fail. Wouldn't it make more sense to set this value to $zkClientTimeout x $numServers? Or to make it configurable outright? Thanks, - Bram