Ok. Based on the discussion, it seems that doing infinite re-creation is better. I will cancel the KIP.
Thanks, Jun On Thu, Nov 2, 2017 at 6:14 PM, Jeff Widman <j...@jeffwidman.com> wrote: > +1 for permanent retry under the covers (without an exposed/later > deprecated config). > > That said, I understand the reality that sometimes we have to workaround an > unfixed issue in another project, so if you think best to expose a config, > then I have no objections. Mainly I wanted to make sure you'd tried to get > upstream to fix as that is almost always a cleaner solution. > > > The above fact implies some reluctance from the zookeeper community to > fully > solve the issue (maybe due to technical issues). > > @Ted - I spent some time a few months ago poking through issues on the ZK > issue tracker, and it looked like there wasn't much activity on the project > lately. So my guess is that it's less about problems with this particular > solution, and more that the solution has just enough moving parts that no > one with commit rights has had the time to review it. As a volunteer > maintainer on a number of projects, I certainly empathize with them, > although it would be nice to get some more committers onto the Zookeeper > project who have the time to review some of these semi-abandoned PRs and > either accept or reject them. > > > > On Thu, Nov 2, 2017 at 3:00 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Stephane: > > bq. hasn't acted in over a year > > > > The above fact implies some reluctance from the zookeeper community to > > fully solve the issue (maybe due to technical issues). > > Anyway, we should plan on not relying on the fix to go through in the > near > > future. > > > > As for Jun's latest suggestion, I think we should add periodic logging > > indicating the retry. > > > > A KIP is not needed if we go that route. > > > > Cheers > > > > On Thu, Nov 2, 2017 at 2:54 PM, Stephane Maarek < > > steph...@simplemachines.com.au> wrote: > > > > > Hi Jun > > > > > > I think this is a better option. Would that change require a kip then > as > > > it's not a change in public API ? > > > > > > @ted it was marked as a blocked for 3.4.11 but they pushed it. It seems > > > that the owner of the pr hasn't acted in over a year and I think > someone > > > needs to take ownership of that. Additionally, this would be a change > in > > > Kafka zookeeper client dependency, so no need to update your zookeeper > > > quorum to benefit from the change > > > > > > Thanks > > > Stéphane > > > > > > > > > On 3 Nov. 2017 8:45 am, "Jun Rao" <j...@confluent.io> wrote: > > > > > > Stephane, Jeff, > > > > > > Another option is to not expose the reconnect timeout config and just > > retry > > > the creation of Zookeeper forever. This is an improvement from the > > current > > > situation and if zookeeper-2184 is fixed in the future, we don't need > to > > > deprecate the config. > > > > > > Thanks, > > > > > > Jun > > > > > > On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown. > > > > > > > > I think adding the session recreation on Kafka side should benefit > > Kafka > > > > users, especially those who don't plan to move to 3.4.12+ in the near > > > > future. > > > > > > > > On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <j...@confluent.io> wrote: > > > > > > > > > Hi, Stephane, > > > > > > > > > > 3) The difference is that currently, there is no retry when > > re-creating > > > > the > > > > > Zookeeper object when a ZK session expires. So, if the re-creation > of > > > > > Zookeeper fails, the broker just logs the error and the Zookeeper > > > object > > > > > will never be created again. With this KIP, we will keep retrying > the > > > > > creation of Zookeeper until success. > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek < > > > > > steph...@simplemachines.com.au> wrote: > > > > > > > > > > > Hi Jun, > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > 1) The reason I'm asking about it is I wonder if it's not worth > > > > focusing > > > > > > the development efforts on taking ownership of the existing PR ( > > > > > > https://github.com/apache/zookeeper/pull/150) to fix > > > ZOOKEEPER-2184, > > > > > > rebase it and have it merged into the ZK codebase shortly. I > feel > > > this > > > > > KIP > > > > > > might introduce a setting that could be deprecated shortly and > > > confuse > > > > > the > > > > > > end user a bit further with one more knob to turn. > > > > > > > > > > > > 3) I'm not sure if I fully understand, sorry for the beginner's > > > > question: > > > > > > if the default timeout is infinite, then it won't change anything > > to > > > > how > > > > > > Kafka works from today, does it? (unless I'm missing something > > > sorry). > > > > If > > > > > > not set to infinite, then we introduce the risk of a whole > cluster > > > > > shutting > > > > > > down at once? > > > > > > > > > > > > Thanks, > > > > > > Stephane > > > > > > > > > > > > On 31/10/17, 1:00 pm, "Jun Rao" <j...@confluent.io> wrote: > > > > > > > > > > > > Hi, Stephane, > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > 1) Fixing the issue in ZK will be ideal. Not sure when it > will > > > > happen > > > > > > though. Once it's fixed, we can probably deprecate this > config. > > > > > > > > > > > > 2) That could be useful. Is there a java api to do that at > > > runtime? > > > > > > Also, > > > > > > invalidating DNS cache doesn't always fix the issue of > > unresolved > > > > > > host. In > > > > > > some of the cases, human intervention is needed. > > > > > > > > > > > > 3) The default timeout is infinite though. > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek < > > > > > > steph...@simplemachines.com.au> wrote: > > > > > > > > > > > > > Hi Jun, > > > > > > > > > > > > > > I think this is very helpful. Restarting Kafka brokers in > > case > > > of > > > > > > zookeeper > > > > > > > host change is not a well known operation. > > > > > > > > > > > > > > Few questions: > > > > > > > 1) would it not be worth fixing the problem at the source ? > > > This > > > > > has > > > > > > been > > > > > > > stuck for a while though, maybe a little push would help : > > > > > > > https://issues.apache.org/jira/plugins/servlet/mobile# > > > > > > issue/ZOOKEEPER-2184 > > > > > > > > > > > > > > 2) upon recreating the zookeeper object , is it not > possible > > to > > > > > > invalidate > > > > > > > the DNS cache so that it resolves the new hostname? > > > > > > > > > > > > > > 3) could the cluster be down in this situation: one > migrates > > an > > > > > > entire > > > > > > > zookeeper cluster to new machines (one by one). The quorum > is > > > > still > > > > > > alive > > > > > > > without downtime, but now every broker in a cluster can't > > > resolve > > > > > > zookeeper > > > > > > > at the same time. They all shut down at the same time after > > the > > > > new > > > > > > > time-out setting. > > > > > > > > > > > > > > Thanks ! > > > > > > > Stéphane > > > > > > > > > > > > > > On 28 Oct. 2017 9:42 am, "Jun Rao" <j...@confluent.io> > wrote: > > > > > > > > > > > > > > > Hi, Everyone, > > > > > > > > > > > > > > > > We created "KIP-217: Expose a timeout to allow an expired > > ZK > > > > > > session to > > > > > > > be > > > > > > > > re-created". > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > > > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+ > > > > > > to+be+re-created > > > > > > > > > > > > > > > > Please take a look and provide your feedback. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > *Jeff Widman* > jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265) > <>< >