Hi, This KIP isn't implemented yet but that sounds good.
Thanks! P On Mon, Apr 29, 2024 at 3:17 AM Federico Valeri <fedeval...@gmail.com> wrote: > Hi Philip, thanks for this KIP. As other mentioned, I think it is > useful in dynamic environment like Kubernetes. > > I was wondering if we should also update the "examples" modules with > this new non-retriable exception. Wdyt? > > On Fri, Apr 21, 2023 at 11:13 PM Philip Nee <philip...@gmail.com> wrote: > > > > Hey Jason, > > > > Thanks again. I've updated the KIP. > > > > P > > > > On Fri, Apr 21, 2023 at 9:56 AM Jason Gustafson > <ja...@confluent.io.invalid> > > wrote: > > > > > Hey Philip, that sounds good to me. > > > > > > -Jason > > > > > > > > > > > > On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <philip...@gmail.com> > wrote: > > > > > > > Hey Jason, > > > > > > > > Thanks for the response. I agree with your suggestions. Just to > clarify, > > > we > > > > want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS > > > > resolution time, and we can throw a fatal, > BootstrapResolutionException > > > (so > > > > not connection exception anymore) afterward. > > > > > > > > I think that aligns with the goal of this KIP. > > > > > > > > P > > > > > > > > On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson > > > <ja...@confluent.io.invalid > > > > > > > > > wrote: > > > > > > > > > Hey Philip, > > > > > > > > > > Yeah, I see your point. Here is the challenge I'm considering. > Today, > > > the > > > > > client does not forward connection-related errors to the > application. > > > It > > > > is > > > > > debatable whether it should, but that is the behavior that > applications > > > > > expect today. The only case of a fatal error is DNS resolution and > the > > > > > application must handle it because they cannot even construct the > > > client. > > > > > Once we have this KIP, we will introduce a separate fatal failure > > > > mechanism > > > > > through the `BootstrapConnectionException`. And we will use it not > only > > > > for > > > > > DNS failures, but general bootstrap connection failures . The > problem > > > is > > > > > that existing applications will not know that this should be > treated > > > as a > > > > > fatal error. So they may continue to retry. Does it help in this > > > > situation > > > > > to poison the client so that it cannot be used? I'm not sure. > Perhaps > > > it > > > > > would reduce the risk a bit if we change ` > > > > bootstrap.connection.timeout.ms` > > > > > to `bootstrap.resolve.timeout.ms` or something like that? Then we > > > retain > > > > > the current behavior if DNS succeeds, but connections fail. I know > I'm > > > > the > > > > > one that suggested generalizing the configuration, but I feel some > > > > > hesitation after thinking about it more. > > > > > > > > > > Thanks, > > > > > Jason > > > > > > > > > > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <philip...@gmail.com> > > > wrote: > > > > > > > > > > > Hey Jason, > > > > > > > > > > > > Thanks for your review. I think if we make it a retriable error, > > > does > > > > it > > > > > > make sense to have a configurable timeout still? as we expect the > > > user > > > > to > > > > > > continue to retry anyway. > > > > > > > > > > > > I'm considering the case of bad configuration. If the user > retries > > > the > > > > > > error, then we rely on the error/warning to alert the user. In > this > > > > > case, > > > > > > maybe we continue using the proposed behavior, i.e. warn on each > poll > > > > > after > > > > > > the timeout period. > > > > > > > > > > > > If you agree that a configuration is needed, maybe we can call > this > > > > > > *bootstrap.auto.retry.ms > > > > > > <http://bootstrap.auto.retry.ms> *instead, to indicate a > > > configurable > > > > > > period of automatic retry. What do you think? > > > > > > > > > > > > Cheers, > > > > > > P > > > > > > > > > > > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson > > > > > <ja...@confluent.io.invalid > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hey Phillip, > > > > > > > > > > > > > > The KIP looks good. 5 minutes seems like a reasonable > tradeoff. I > > > do > > > > > > wonder > > > > > > > if it is necessary to treat bootstrap timeout as a fatal error > > > > though. > > > > > It > > > > > > > seems possible that the exception might be caught by handlers > in > > > > > existing > > > > > > > applications which may not expect that the client needs to be > > > > > restarted. > > > > > > > Perhaps it would be safer to make it retriable? As long as the > > > > > > application > > > > > > > continues trying to use the client, we could continue trying to > > > reach > > > > > the > > > > > > > bootstrap servers perhaps? That would be closer to behavior > today > > > > which > > > > > > > only treats DNS resolution failures as fatal. What do you > think? > > > > > > > > > > > > > > Best, > > > > > > > Jason > > > > > > > > > > > > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee < > philip...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Thanks, everyone: I'm starting a vote today. Here's the > recap > > > for > > > > > some > > > > > > > of > > > > > > > > the questions: > > > > > > > > > > > > > > > > John: I changed the proposal to throw a non-retriable > exception > > > > after > > > > > > the > > > > > > > > timeout elapses. I feel it might be necessary to poison the > > > client > > > > > > after > > > > > > > > retry expires, as it might indicate a real issue. > > > > > > > > Ismael: The proposal is to add a configuration for the retry > and > > > it > > > > > > will > > > > > > > > throw a non-retriable exception after the time expires. > > > > > > > > Chris: Addressed some unclarity that you mentioned, and a > new API > > > > > won't > > > > > > > be > > > > > > > > introduced in this KIP. Maybe up for future discussion. > > > > > > > > Jason: I'm proposing adding a timeout config and a bootstrap > > > > > exception > > > > > > > per > > > > > > > > your suggestion. > > > > > > > > Kirk: I'm proposing throwing a non-retriable exception in the > > > > network > > > > > > > > client. See previous comment. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton > > > > > <chr...@aiven.io.invalid > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Philip, > > > > > > > > > > > > > > > > > > Yeah, "DNS resolution should occur..." seems like a better > > > fit. > > > > 👍 > > > > > > > > > > > > > > > > > > One other question I have is whether we should expose some > kind > > > > of > > > > > > > public > > > > > > > > > API for performing preflight validation of the bootstrap > URLs. > > > If > > > > > we > > > > > > > > change > > > > > > > > > the behavior of a client configured with a silly typo > (e.g., > > > > > > > > > "loclahost instead of localhost") from failing in the > > > constructor > > > > > to > > > > > > > > > failing with a retriable exception, this might lead some > client > > > > > > > > > applications to handle that failure by, well, retrying. For > > > > > > reference, > > > > > > > > this > > > > > > > > > is exactly what we do in Kafka Connect right now; see [1] > and > > > > [2]. > > > > > > IMO > > > > > > > > it'd > > > > > > > > > be nice to be able to opt into keeping the current > behavior so > > > > that > > > > > > > > > projects like Connect could still do preflight checks of > the > > > > > > > > > bootstrap.servers property for connectors before starting > them, > > > > and > > > > > > > > report > > > > > > > > > any issues by failing fast instead of continuously writing > > > > > > > warning/error > > > > > > > > > messages to their logs. > > > > > > > > > > > > > > > > > > I'm not sure about where this new API could go, but a few > > > options > > > > > > might > > > > > > > > be: > > > > > > > > > > > > > > > > > > - Expose a public variant of the existing ClientUtils class > > > > > > > > > - Add static methods to the ConsumerConfig, > ProducerConfig, and > > > > > > > > > AdminClientConfig classes > > > > > > > > > - Add those same static methods to the KafkaConsumer, > > > > > KafkaProducer, > > > > > > > and > > > > > > > > > KafkaAdminClient classes > > > > > > > > > > > > > > > > > > If this seems reasonable, we should probably also specify > in > > > the > > > > > KIP > > > > > > > that > > > > > > > > > Kafka Connect will leverage this preflight validation logic > > > > before > > > > > > > > > instantiating any Kafka clients for use by connectors or > tasks, > > > > and > > > > > > > > > continue to fail fast if there are typos in the > > > bootstrap.servers > > > > > > > > property, > > > > > > > > > or if temporary DNS resolution issues come up. > > > > > > > > > > > > > > > > > > [1] - > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606 > > > > > > > > > [2] - > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439 > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > > > > Chris > > > > > > > > > > > > > > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee < > > > philip...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hey Chris, > > > > > > > > > > > > > > > > > > > > Thanks for the quick response, and I apologize for the > > > unclear > > > > > > > wording > > > > > > > > > > there, I guess "DNS lookup" would be a more appropriate > > > wording > > > > > > here. > > > > > > > > So > > > > > > > > > > what I meant there was, to delegate the DNS lookup in the > > > > > > constructor > > > > > > > > to > > > > > > > > > > the network client poll, and it will happen on the very > first > > > > > poll. > > > > > > > I > > > > > > > > > > guess the logic could look like this: > > > > > > > > > > > > > > > > > > > > - if the client has been bootstrapped, do nothing. > > > > > > > > > > - Otherwise, perform DNS lookup, and acquire the > bootstrap > > > > server > > > > > > > > > address. > > > > > > > > > > > > > > > > > > > > Thanks for the comment there, I'll change up the wording. > > > > Maybe > > > > > > > revise > > > > > > > > > it > > > > > > > > > > as "DNS resolution should occur in the poll...." ? > > > > > > > > > > > > > > > > > > > > P > > > > > > > > > > > > > > > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton > > > > > > > <chr...@aiven.io.invalid > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Philip, > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP! > > > > > > > > > > > > > > > > > > > > > > QQ: In the "Proposed Changes" section, the KIP states > that > > > > > > > > > "Bootstrapping > > > > > > > > > > > should now occur in the poll method before attempting > to > > > > update > > > > > > the > > > > > > > > > > > metadata. This includes resolving the addresses and > > > > > bootstrapping > > > > > > > the > > > > > > > > > > > metadata.". By "bootstrapping the metadata" do we mean > > > > actually > > > > > > > > > > contacting > > > > > > > > > > > the bootstrap servers, or just setting some internal > state > > > > > > related > > > > > > > to > > > > > > > > > the > > > > > > > > > > > current set of servers that can be contacted for > metadata? > > > I > > > > > ask > > > > > > > > > because > > > > > > > > > > it > > > > > > > > > > > seems like the language here implies the former, but if > > > > that's > > > > > > the > > > > > > > > > case, > > > > > > > > > > > this is already happening in poll (or at least, the > first > > > > > > > invocation > > > > > > > > of > > > > > > > > > > > it), and if it's the latter, it's probably not > necessary to > > > > > > mention > > > > > > > > in > > > > > > > > > > the > > > > > > > > > > > KIP since it doesn't really impact user-facing > behavior. It > > > > > also > > > > > > > > seems > > > > > > > > > > like > > > > > > > > > > > that detail might impact how intertwined this and > KIP-899 > > > > are, > > > > > > > though > > > > > > > > > the > > > > > > > > > > > similarity could still be superficial either way. > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > > > > > > > > Chris > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee < > > > > > philip...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hey Ismael, > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback! The proposal is not to retry > > > > > > > automatically > > > > > > > > > but > > > > > > > > > > > > relies on the user polling the NetworkClient > (basically, > > > > > > > > > consumer.poll) > > > > > > > > > > > to > > > > > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a > > > > > > > NetworkException > > > > > > > > > > > > (retriable) will be thrown. > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > P > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma < > > > > > ism...@juma.me.uk > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but > how > > > long > > > > > > will > > > > > > > we > > > > > > > > > > retry > > > > > > > > > > > > for > > > > > > > > > > > > > and when do we give up and propagate the failure > to the > > > > > user? > > > > > > > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee < > > > > > > > philip...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all! > > > > > > > > > > > > > > > > > > > > > > > > > > > > I want to start a discussion thread about how we > can > > > > > handle > > > > > > > > > client > > > > > > > > > > > > > > bootstrap failure due DNS lookup. This requires > a > > > bit > > > > of > > > > > > > > > > behavioral > > > > > > > > > > > > > > change, so a KIP is proposed and attached to this > > > > email. > > > > > > Let > > > > > > > me > > > > > > > > > > know > > > > > > > > > > > > what > > > > > > > > > > > > > > you think! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *A small remark here*: *As the title of this KIP > > > might > > > > > > sound > > > > > > > > > > > > > > familiar/similar to KIP-899, it is not the same.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > *In Summary:* I want to propose a KIP to change > the > > > > > > existing > > > > > > > > > > > bootstrap > > > > > > > > > > > > > > (upon instantiation) strategy because it is > > > reasonable > > > > to > > > > > > > allow > > > > > > > > > > > clients > > > > > > > > > > > > > to > > > > > > > > > > > > > > retry > > > > > > > > > > > > > > > > > > > > > > > > > > > > *KIP: * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > Philip > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >