Thanks Jun for the clarification It sounds like this kip is complementary to the zookeeper-2184 and can move forward without it. We should still push hard for zookeeper-2184 to go through (saw you commented on it earlier)
LGTM! On 2 Nov. 2017 12:34 pm, "Jun Rao" <j...@confluent.io> wrote: > Hi, Stephane, > > 3) The difference is that currently, there is no retry when re-creating the > Zookeeper object when a ZK session expires. So, if the re-creation of > Zookeeper fails, the broker just logs the error and the Zookeeper object > will never be created again. With this KIP, we will keep retrying the > creation of Zookeeper until success. > > Thanks, > > Jun > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek < > steph...@simplemachines.com.au> wrote: > > > Hi Jun, > > > > Thanks for the reply. > > > > 1) The reason I'm asking about it is I wonder if it's not worth focusing > > the development efforts on taking ownership of the existing PR ( > > https://github.com/apache/zookeeper/pull/150) to fix ZOOKEEPER-2184, > > rebase it and have it merged into the ZK codebase shortly. I feel this > KIP > > might introduce a setting that could be deprecated shortly and confuse > the > > end user a bit further with one more knob to turn. > > > > 3) I'm not sure if I fully understand, sorry for the beginner's question: > > if the default timeout is infinite, then it won't change anything to how > > Kafka works from today, does it? (unless I'm missing something sorry). If > > not set to infinite, then we introduce the risk of a whole cluster > shutting > > down at once? > > > > Thanks, > > Stephane > > > > On 31/10/17, 1:00 pm, "Jun Rao" <j...@confluent.io> wrote: > > > > Hi, Stephane, > > > > Thanks for the reply. > > > > 1) Fixing the issue in ZK will be ideal. Not sure when it will happen > > though. Once it's fixed, we can probably deprecate this config. > > > > 2) That could be useful. Is there a java api to do that at runtime? > > Also, > > invalidating DNS cache doesn't always fix the issue of unresolved > > host. In > > some of the cases, human intervention is needed. > > > > 3) The default timeout is infinite though. > > > > Jun > > > > > > On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek < > > steph...@simplemachines.com.au> wrote: > > > > > Hi Jun, > > > > > > I think this is very helpful. Restarting Kafka brokers in case of > > zookeeper > > > host change is not a well known operation. > > > > > > Few questions: > > > 1) would it not be worth fixing the problem at the source ? This > has > > been > > > stuck for a while though, maybe a little push would help : > > > https://issues.apache.org/jira/plugins/servlet/mobile# > > issue/ZOOKEEPER-2184 > > > > > > 2) upon recreating the zookeeper object , is it not possible to > > invalidate > > > the DNS cache so that it resolves the new hostname? > > > > > > 3) could the cluster be down in this situation: one migrates an > > entire > > > zookeeper cluster to new machines (one by one). The quorum is still > > alive > > > without downtime, but now every broker in a cluster can't resolve > > zookeeper > > > at the same time. They all shut down at the same time after the new > > > time-out setting. > > > > > > Thanks ! > > > Stéphane > > > > > > On 28 Oct. 2017 9:42 am, "Jun Rao" <j...@confluent.io> wrote: > > > > > > > Hi, Everyone, > > > > > > > > We created "KIP-217: Expose a timeout to allow an expired ZK > > session to > > > be > > > > re-created". > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+ > > to+be+re-created > > > > > > > > Please take a look and provide your feedback. > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > > > > > > > > >