Hey Philip, that sounds good to me. -Jason
On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <philip...@gmail.com> wrote: > Hey Jason, > > Thanks for the response. I agree with your suggestions. Just to clarify, we > want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS > resolution time, and we can throw a fatal, BootstrapResolutionException (so > not connection exception anymore) afterward. > > I think that aligns with the goal of this KIP. > > P > > On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson <ja...@confluent.io.invalid > > > wrote: > > > Hey Philip, > > > > Yeah, I see your point. Here is the challenge I'm considering. Today, the > > client does not forward connection-related errors to the application. It > is > > debatable whether it should, but that is the behavior that applications > > expect today. The only case of a fatal error is DNS resolution and the > > application must handle it because they cannot even construct the client. > > Once we have this KIP, we will introduce a separate fatal failure > mechanism > > through the `BootstrapConnectionException`. And we will use it not only > for > > DNS failures, but general bootstrap connection failures . The problem is > > that existing applications will not know that this should be treated as a > > fatal error. So they may continue to retry. Does it help in this > situation > > to poison the client so that it cannot be used? I'm not sure. Perhaps it > > would reduce the risk a bit if we change ` > bootstrap.connection.timeout.ms` > > to `bootstrap.resolve.timeout.ms` or something like that? Then we retain > > the current behavior if DNS succeeds, but connections fail. I know I'm > the > > one that suggested generalizing the configuration, but I feel some > > hesitation after thinking about it more. > > > > Thanks, > > Jason > > > > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <philip...@gmail.com> wrote: > > > > > Hey Jason, > > > > > > Thanks for your review. I think if we make it a retriable error, does > it > > > make sense to have a configurable timeout still? as we expect the user > to > > > continue to retry anyway. > > > > > > I'm considering the case of bad configuration. If the user retries the > > > error, then we rely on the error/warning to alert the user. In this > > case, > > > maybe we continue using the proposed behavior, i.e. warn on each poll > > after > > > the timeout period. > > > > > > If you agree that a configuration is needed, maybe we can call this > > > *bootstrap.auto.retry.ms > > > <http://bootstrap.auto.retry.ms> *instead, to indicate a configurable > > > period of automatic retry. What do you think? > > > > > > Cheers, > > > P > > > > > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson > > <ja...@confluent.io.invalid > > > > > > > wrote: > > > > > > > Hey Phillip, > > > > > > > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do > > > wonder > > > > if it is necessary to treat bootstrap timeout as a fatal error > though. > > It > > > > seems possible that the exception might be caught by handlers in > > existing > > > > applications which may not expect that the client needs to be > > restarted. > > > > Perhaps it would be safer to make it retriable? As long as the > > > application > > > > continues trying to use the client, we could continue trying to reach > > the > > > > bootstrap servers perhaps? That would be closer to behavior today > which > > > > only treats DNS resolution failures as fatal. What do you think? > > > > > > > > Best, > > > > Jason > > > > > > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <philip...@gmail.com> > > wrote: > > > > > > > > > Thanks, everyone: I'm starting a vote today. Here's the recap for > > some > > > > of > > > > > the questions: > > > > > > > > > > John: I changed the proposal to throw a non-retriable exception > after > > > the > > > > > timeout elapses. I feel it might be necessary to poison the client > > > after > > > > > retry expires, as it might indicate a real issue. > > > > > Ismael: The proposal is to add a configuration for the retry and it > > > will > > > > > throw a non-retriable exception after the time expires. > > > > > Chris: Addressed some unclarity that you mentioned, and a new API > > won't > > > > be > > > > > introduced in this KIP. Maybe up for future discussion. > > > > > Jason: I'm proposing adding a timeout config and a bootstrap > > exception > > > > per > > > > > your suggestion. > > > > > Kirk: I'm proposing throwing a non-retriable exception in the > network > > > > > client. See previous comment. > > > > > > > > > > > > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton > > <chr...@aiven.io.invalid > > > > > > > > > wrote: > > > > > > > > > > > Hi Philip, > > > > > > > > > > > > Yeah, "DNS resolution should occur..." seems like a better fit. > 👍 > > > > > > > > > > > > One other question I have is whether we should expose some kind > of > > > > public > > > > > > API for performing preflight validation of the bootstrap URLs. If > > we > > > > > change > > > > > > the behavior of a client configured with a silly typo (e.g., > > > > > > "loclahost instead of localhost") from failing in the constructor > > to > > > > > > failing with a retriable exception, this might lead some client > > > > > > applications to handle that failure by, well, retrying. For > > > reference, > > > > > this > > > > > > is exactly what we do in Kafka Connect right now; see [1] and > [2]. > > > IMO > > > > > it'd > > > > > > be nice to be able to opt into keeping the current behavior so > that > > > > > > projects like Connect could still do preflight checks of the > > > > > > bootstrap.servers property for connectors before starting them, > and > > > > > report > > > > > > any issues by failing fast instead of continuously writing > > > > warning/error > > > > > > messages to their logs. > > > > > > > > > > > > I'm not sure about where this new API could go, but a few options > > > might > > > > > be: > > > > > > > > > > > > - Expose a public variant of the existing ClientUtils class > > > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and > > > > > > AdminClientConfig classes > > > > > > - Add those same static methods to the KafkaConsumer, > > KafkaProducer, > > > > and > > > > > > KafkaAdminClient classes > > > > > > > > > > > > If this seems reasonable, we should probably also specify in the > > KIP > > > > that > > > > > > Kafka Connect will leverage this preflight validation logic > before > > > > > > instantiating any Kafka clients for use by connectors or tasks, > and > > > > > > continue to fail fast if there are typos in the bootstrap.servers > > > > > property, > > > > > > or if temporary DNS resolution issues come up. > > > > > > > > > > > > [1] - > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606 > > > > > > [2] - > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439 > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Chris > > > > > > > > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com> > > > > wrote: > > > > > > > > > > > > > Hey Chris, > > > > > > > > > > > > > > Thanks for the quick response, and I apologize for the unclear > > > > wording > > > > > > > there, I guess "DNS lookup" would be a more appropriate wording > > > here. > > > > > So > > > > > > > what I meant there was, to delegate the DNS lookup in the > > > constructor > > > > > to > > > > > > > the network client poll, and it will happen on the very first > > poll. > > > > I > > > > > > > guess the logic could look like this: > > > > > > > > > > > > > > - if the client has been bootstrapped, do nothing. > > > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap > server > > > > > > address. > > > > > > > > > > > > > > Thanks for the comment there, I'll change up the wording. > Maybe > > > > revise > > > > > > it > > > > > > > as "DNS resolution should occur in the poll...." ? > > > > > > > > > > > > > > P > > > > > > > > > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton > > > > <chr...@aiven.io.invalid > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Philip, > > > > > > > > > > > > > > > > Thanks for the KIP! > > > > > > > > > > > > > > > > QQ: In the "Proposed Changes" section, the KIP states that > > > > > > "Bootstrapping > > > > > > > > should now occur in the poll method before attempting to > update > > > the > > > > > > > > metadata. This includes resolving the addresses and > > bootstrapping > > > > the > > > > > > > > metadata.". By "bootstrapping the metadata" do we mean > actually > > > > > > > contacting > > > > > > > > the bootstrap servers, or just setting some internal state > > > related > > > > to > > > > > > the > > > > > > > > current set of servers that can be contacted for metadata? I > > ask > > > > > > because > > > > > > > it > > > > > > > > seems like the language here implies the former, but if > that's > > > the > > > > > > case, > > > > > > > > this is already happening in poll (or at least, the first > > > > invocation > > > > > of > > > > > > > > it), and if it's the latter, it's probably not necessary to > > > mention > > > > > in > > > > > > > the > > > > > > > > KIP since it doesn't really impact user-facing behavior. It > > also > > > > > seems > > > > > > > like > > > > > > > > that detail might impact how intertwined this and KIP-899 > are, > > > > though > > > > > > the > > > > > > > > similarity could still be superficial either way. > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > > Chris > > > > > > > > > > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee < > > philip...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > Hey Ismael, > > > > > > > > > > > > > > > > > > Thanks for the feedback! The proposal is not to retry > > > > automatically > > > > > > but > > > > > > > > > relies on the user polling the NetworkClient (basically, > > > > > > consumer.poll) > > > > > > > > to > > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a > > > > NetworkException > > > > > > > > > (retriable) will be thrown. > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > P > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma < > > ism...@juma.me.uk > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how long > > > will > > > > we > > > > > > > retry > > > > > > > > > for > > > > > > > > > > and when do we give up and propagate the failure to the > > user? > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee < > > > > philip...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi all! > > > > > > > > > > > > > > > > > > > > > > I want to start a discussion thread about how we can > > handle > > > > > > client > > > > > > > > > > > bootstrap failure due DNS lookup. This requires a bit > of > > > > > > > behavioral > > > > > > > > > > > change, so a KIP is proposed and attached to this > email. > > > Let > > > > me > > > > > > > know > > > > > > > > > what > > > > > > > > > > > you think! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *A small remark here*: *As the title of this KIP might > > > sound > > > > > > > > > > > familiar/similar to KIP-899, it is not the same.* > > > > > > > > > > > > > > > > > > > > > > *In Summary:* I want to propose a KIP to change the > > > existing > > > > > > > > bootstrap > > > > > > > > > > > (upon instantiation) strategy because it is reasonable > to > > > > allow > > > > > > > > clients > > > > > > > > > > to > > > > > > > > > > > retry > > > > > > > > > > > > > > > > > > > > > > *KIP: * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > Philip > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >