Sounds good. I think it's ready to call a vote. Thanks Philip!
On Wed, Apr 5, 2023, at 11:24 AM, Philip Nee wrote: > Hi all, > > The KIP has been around for some time, and I've updated the document > according to the previous comments. Here is the outline of the proposed > changes: > 1. Adding a timeout configuration. > 2. Adding a new exception type > 3. Adding a WARN level log message > > The obvious changes on the client side are: > 1. They won't attempt to resolve for DNS at the constructor level > 2. They will try to bootstrap once network client poll is invoked. > > In terms of client-side behavior: > *Consumer*: Either the poll timer runs out and returns an empty record, or > BootstrapConnectionException thrown > *Producer*: API calls will be blocked on waitOnMetadata until the timer > runs out, either the max block ms, or the bootstrap timer. > *Admin*: The API calls timeout if API timeout expires first; otherwise, it > throws a BootstrapConnectionException. > > Let me know your thoughts: I would like to start voting in a week or so. > > Link: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+DNS+Resolution+Failure+Should+Not+Fail+the+Clients > > Thanks, > P > > On Thu, Mar 23, 2023 at 12:59 PM Philip Nee <philip...@gmail.com> wrote: > > > Hey Kirk, > > > > Sorry about omitting your response; it slipped through the cracks... > > > > *I’m not sure if producers and consumers likewise do DNS resolution in > > their constructors?* > > Yes, both producer and consumer bootstrap in the constructor. > > > > *I agree that moving the DNS resolution to poll(), it would be hard to > > distinguish hard failures (host name resolution) from transient network > > issues. After all, the DNS resolution issue we saw was technically a > > transient issue.* > > I'm amending a timeout configuration in my KIP because that should resolve > > any transient issues if it doesn't persist for too long. If it is a hard > > failure (like, misconfiguration of the host name), it should be logged and > > errored out after the expiration. > > > > P > > > > On Wed, Mar 8, 2023 at 8:12 AM Kirk True <k...@kirktrue.pro> wrote: > > > >> Hi Philip, > >> > >> I’m understanding the options proposed as consisting of these questions: > >> > >> Should we throw an Exception or not? > >> Where should the DNS resolution/bootstrapping occur—in the constructor, > >> poll, or somewhere else? > >> Should there be a timeout, and if so, what configuration drives it? > >> > >> We were seeing instances of this issue when constructing KafkaAdminClient > >> instances. The constructor performs the DNS lookup in that client. I’m not > >> sure if producers and consumers likewise do DNS resolution in their > >> constructors? > >> > >> I agree that moving the DNS resolution to poll(), it would be hard to > >> distinguish hard failures (host name resolution) from transient network > >> issues. After all, the DNS resolution issue we saw was technically a > >> transient issue. > >> > >> Is this an accurate summary of the current thinking: > >> > >> Per Jason's suggestion, introduce a new BootstrapConnectionException > >> Per Chris’ suggestion, introduce a new > >> API—performInitialDnsResolution()—that an application developer can call to > >> perform a fast-fail check, throwing BootstrapConnectionException > >> Otherwise, move the DNS resolution to poll() > >> Handle BootstrapConnectionException failures in poll() similar to how we > >> handle NetworkException today, with retries, timeouts, etc., i.e. we don’t > >> introduce any new configuration > >> Improve logging to distinguish the DNS resolution case > >> > >> Thanks, > >> Kirk > >> > >> > On Mar 6, 2023, at 9:15 AM, Philip Nee <philip...@gmail.com> wrote: > >> > > >> > Cheers Kafka Community, > >> > > >> > I just wanna give this thread bump, as it has been a bit quiet for the > >> past > >> > week. I have not updated the KIP based on Chris and Jason's feedback, > >> as I > >> > would also like to know more about what do people think. > >> > > >> > Jason - Thanks for the suggestion, I think your suggestion makes a lot > >> of > >> > sense. > >> > > >> > Thanks! > >> > P > >> > > >> > On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson > >> <ja...@confluent.io.invalid> > >> > wrote: > >> > > >> >> Hi Philip, > >> >> > >> >>> Having an overall timeout also seems reasonable, but I wonder what > >> should > >> >> the client do after running out of the time? Should we throw a > >> >> non-retriable exception (instead of TimeoutExceptoin to stop the client > >> >> from retrying) and alert the user to examine the config and the DNS > >> server? > >> >> > >> >> Yeah, not sure exactly. I'd probably suggest a > >> >> `BootstrapConnectionException` or something like that with a clear > >> message > >> >> indicating the problem. What the user does with it is up to them, but > >> at > >> >> least it gives them the option to fail their application if that is > >> what > >> >> they prefer to do in this case. If they catch it and ignore it, I would > >> >> expect the client to just continue retrying. Logging for bootstrap > >> >> dns/connection failures will be helpful in any case. > >> >> > >> >> -Jason > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <philip...@gmail.com> > >> wrote: > >> >> > >> >>> Jason: > >> >>> Thanks for the feedback. Now giving it a second thought, I think your > >> >>> suggestion of logging the error might make sense, as I could imagine > >> most > >> >>> users would just continue to retry, so it might not be necessary to > >> throw > >> >>> an exception anyway. > >> >>> Having an overall timeout also seems reasonable, but I wonder what > >> should > >> >>> the client do after running out of the time? Should we throw a > >> >>> non-retriable exception (instead of TimeoutExceptoin to stop the > >> client > >> >>> from retrying) and alert the user to examine the config and the DNS > >> >> server? > >> >>> > >> >>> Chris: > >> >>> I feel I still haven't answered your question about the pre-flight > >> check, > >> >>> as it seems exposing an API might be harder to push through. > >> >>> > >> >>> Thanks! > >> >>> P > >> >>> > >> >>> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson > >> >>> <ja...@confluent.io.invalid> > >> >>> wrote: > >> >>> > >> >>>> One more random thought I had just as I pushed send. We're currently > >> >>>> treating this problem somewhat narrowly by focusing only on the DNS > >> >>>> resolution of the bootstrap servers. Even if the servers resolve, > >> >> there's > >> >>>> no guarantee that they are reachable by the client. Would it make > >> sense > >> >>> to > >> >>>> have a timeout which bounds the total time that the client should > >> wait > >> >> to > >> >>>> connect to the bootstrap servers? Something like ` > >> >>>> bootstrap.servers.connection.timeout.ms`. > >> >>>> > >> >>>> -Jason > >> >>>> > >> >>>> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io > >> > > >> >>>> wrote: > >> >>>> > >> >>>>> Hi Philip, > >> >>>>> > >> >>>>> An alternative is not to fail at all. Every other network error is > >> >>> caught > >> >>>>> and handled internally in the client. We see this case as different > >> >>>> because > >> >>>>> a DNS resolution error may imply misconfiguration. Could it also > >> >> imply > >> >>>> that > >> >>>>> the DNS server is unavailable? I'm not sure why that case should be > >> >>>> handled > >> >>>>> differently than if the bootstrap servers themselves are > >> unavailable. > >> >>>> Would > >> >>>>> it be enough to log a clear warning in the logs if the bootstrap > >> >>> servers > >> >>>>> could not resolve? > >> >>>>> > >> >>>>> On the whole, the current fail-fast approach probably does more good > >> >>> than > >> >>>>> bad, but it does seem somewhat inconsistent overall and my guess is > >> >>> that > >> >>>>> dynamic environments will become increasingly common. It would be > >> >> nice > >> >>> to > >> >>>>> have a reasonable default behavior which could handle these cases > >> >>>>> gracefully without any additional logic. In any case, it would be > >> >> nice > >> >>> to > >> >>>>> see this option in the rejected alternatives at least if we do not > >> >> take > >> >>>> it. > >> >>>>> > >> >>>>> If we want to take the route of throwing an exception, then I think > >> >>> we're > >> >>>>> probably going to need a new configuration since I can't see what a > >> >>>>> reasonable timeout we would use as a default. The benefit of a > >> >>>>> configuration is that it would let us retain the current default > >> >>> behavior > >> >>>>> with timeout effectively set to 0 and it would also let users > >> >>> effectively > >> >>>>> disable the timeout by using a very large value. Otherwise, it seems > >> >>>> like a > >> >>>>> potential compatibility break to have a new exception type thrown at > >> >>> some > >> >>>>> arbitrary time without giving the user any control over it. > >> >>>>> > >> >>>>> Thanks, > >> >>>>> Jason > >> >>>>> > >> >>>>> > >> >>>>> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton > >> >> <chr...@aiven.io.invalid > >> >>>> > >> >>>>> wrote: > >> >>>>> > >> >>>>>> Hi Philip, > >> >>>>>> > >> >>>>>> Yeah, it's basically DNS resolution we're talking about, though > >> >>> there's > >> >>>>>> some additional subtlety there with the logic introduced by KIP-235 > >> >>> [1]. > >> >>>>>> Essentially it should cover any scenario that causes a client > >> >>>> constructor > >> >>>>>> to fail with the current logic but would not after this KIP is > >> >>> released. > >> >>>>>> > >> >>>>>> We can generalize the Connect use case like this: a client > >> >> application > >> >>>>>> that > >> >>>>>> may connect to different Kafka clusters with a public-facing, > >> >>>> easy-to-use > >> >>>>>> API for restarting failed tasks and automatic handling of retriable > >> >>>>>> exceptions. The ease with which failed tasks can be restarted is > >> >>>>>> significant because it reduces the cost of failing on non-retriable > >> >>>>>> exceptions and makes fail-fast behavior easier to work with. And, > >> in > >> >>>> cases > >> >>>>>> like this where we can't really know whether the error we're > >> dealing > >> >>>> with > >> >>>>>> is retriable or not, it's better IMO to continue to allow > >> >> applications > >> >>>>>> like > >> >>>>>> these to fail fast. I do agree that it'd be nice to get input from > >> >> the > >> >>>>>> community, though. > >> >>>>>> > >> >>>>>> I was toying with the idea of a NetworkException subclass too. It's > >> >> a > >> >>>>>> simpler API, but it doesn't allow for preflight validation, which > >> >> can > >> >>> be > >> >>>>>> useful in scenarios where submitting new configurations for client > >> >>>>>> applications is expensive in terms of time or resources. Then > >> >> again, I > >> >>>>>> don't see why the two are mutually exclusive, and we might opt to > >> >> use > >> >>>> the > >> >>>>>> NetworkException subclass in this KIP and pursue an opt-in > >> >> validation > >> >>>> API > >> >>>>>> later on. Thoughts? > >> >>>>>> > >> >>>>>> [1] - > >> >>>>>> > >> >>>>>> > >> >>>> > >> >>> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection > >> >>>>>> > >> >>>>>> Cheers, > >> >>>>>> > >> >>>>>> Chris > >> >>>>>> > >> >>>>>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <philip...@gmail.com> > >> >>> wrote: > >> >>>>>> > >> >>>>>>> Hey Chris, > >> >>>>>>> > >> >>>>>>> Thanks again for the feedback! > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> For the preflight DNS check (are we basically trying to resolve > >> >> the > >> >>>> DNS > >> >>>>>>> there?): Maybe it makes more sense to add it to the Config > >> >> modules? > >> >>> I > >> >>>>>> would > >> >>>>>>> like to hear what the community says as I'm not familiar with the > >> >>>>>> Connect > >> >>>>>>> use case. > >> >>>>>>> > >> >>>>>>> A "slower failing" alternative - I wonder if it makes sense for us > >> >>> to > >> >>>>>>> extend the NetworkException so that clients can be smarter at > >> >>> handling > >> >>>>>>> these exceptions. Of course, it is still retriable and requires > >> >>>> polling > >> >>>>>> the > >> >>>>>>> consumer, but then we can distinguish the DNS resolution error > >> >> from > >> >>>>>> other > >> >>>>>>> network errors. > >> >>>>>>> > >> >>>>>>> Thanks! > >> >>>>>>> P > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton > >> >>> <chr...@aiven.io.invalid > >> >>>>> > >> >>>>>>> wrote: > >> >>>>>>> > >> >>>>>>>> Hi Philip, > >> >>>>>>>> > >> >>>>>>>> Yeah, "DNS resolution should occur..." seems like a better fit. > >> >>> 👍 > >> >>>>>>>> > >> >>>>>>>> One other question I have is whether we should expose some kind > >> >> of > >> >>>>>> public > >> >>>>>>>> API for performing preflight validation of the bootstrap URLs. > >> >> If > >> >>> we > >> >>>>>>> change > >> >>>>>>>> the behavior of a client configured with a silly typo (e.g., > >> >>>>>>>> "loclahost instead of localhost") from failing in the > >> >> constructor > >> >>> to > >> >>>>>>>> failing with a retriable exception, this might lead some client > >> >>>>>>>> applications to handle that failure by, well, retrying. For > >> >>>> reference, > >> >>>>>>> this > >> >>>>>>>> is exactly what we do in Kafka Connect right now; see [1] and > >> >> [2]. > >> >>>> IMO > >> >>>>>>> it'd > >> >>>>>>>> be nice to be able to opt into keeping the current behavior so > >> >>> that > >> >>>>>>>> projects like Connect could still do preflight checks of the > >> >>>>>>>> bootstrap.servers property for connectors before starting them, > >> >>> and > >> >>>>>>> report > >> >>>>>>>> any issues by failing fast instead of continuously writing > >> >>>>>> warning/error > >> >>>>>>>> messages to their logs. > >> >>>>>>>> > >> >>>>>>>> I'm not sure about where this new API could go, but a few > >> >> options > >> >>>>>> might > >> >>>>>>> be: > >> >>>>>>>> > >> >>>>>>>> - Expose a public variant of the existing ClientUtils class > >> >>>>>>>> - Add static methods to the ConsumerConfig, ProducerConfig, and > >> >>>>>>>> AdminClientConfig classes > >> >>>>>>>> - Add those same static methods to the KafkaConsumer, > >> >>> KafkaProducer, > >> >>>>>> and > >> >>>>>>>> KafkaAdminClient classes > >> >>>>>>>> > >> >>>>>>>> If this seems reasonable, we should probably also specify in the > >> >>> KIP > >> >>>>>> that > >> >>>>>>>> Kafka Connect will leverage this preflight validation logic > >> >> before > >> >>>>>>>> instantiating any Kafka clients for use by connectors or tasks, > >> >>> and > >> >>>>>>>> continue to fail fast if there are typos in the > >> >> bootstrap.servers > >> >>>>>>> property, > >> >>>>>>>> or if temporary DNS resolution issues come up. > >> >>>>>>>> > >> >>>>>>>> [1] - > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>> > >> >>> > >> >> > >> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606 > >> >>>>>>>> [2] - > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>> > >> >>> > >> >> > >> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439 > >> >>>>>>>> > >> >>>>>>>> Cheers, > >> >>>>>>>> > >> >>>>>>>> Chris > >> >>>>>>>> > >> >>>>>>>> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com > >> >>> > >> >>>>>> wrote: > >> >>>>>>>> > >> >>>>>>>>> Hey Chris, > >> >>>>>>>>> > >> >>>>>>>>> Thanks for the quick response, and I apologize for the unclear > >> >>>>>> wording > >> >>>>>>>>> there, I guess "DNS lookup" would be a more appropriate > >> >> wording > >> >>>>>> here. > >> >>>>>>> So > >> >>>>>>>>> what I meant there was, to delegate the DNS lookup in the > >> >>>>>> constructor > >> >>>>>>> to > >> >>>>>>>>> the network client poll, and it will happen on the very first > >> >>>>>> poll. I > >> >>>>>>>>> guess the logic could look like this: > >> >>>>>>>>> > >> >>>>>>>>> - if the client has been bootstrapped, do nothing. > >> >>>>>>>>> - Otherwise, perform DNS lookup, and acquire the bootstrap > >> >>> server > >> >>>>>>>> address. > >> >>>>>>>>> > >> >>>>>>>>> Thanks for the comment there, I'll change up the wording. > >> >> Maybe > >> >>>>>> revise > >> >>>>>>>> it > >> >>>>>>>>> as "DNS resolution should occur in the poll...." ? > >> >>>>>>>>> > >> >>>>>>>>> P > >> >>>>>>>>> > >> >>>>>>>>> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton > >> >>>>>> <chr...@aiven.io.invalid > >> >>>>>>>> > >> >>>>>>>>> wrote: > >> >>>>>>>>> > >> >>>>>>>>>> Hi Philip, > >> >>>>>>>>>> > >> >>>>>>>>>> Thanks for the KIP! > >> >>>>>>>>>> > >> >>>>>>>>>> QQ: In the "Proposed Changes" section, the KIP states that > >> >>>>>>>> "Bootstrapping > >> >>>>>>>>>> should now occur in the poll method before attempting to > >> >>> update > >> >>>>>> the > >> >>>>>>>>>> metadata. This includes resolving the addresses and > >> >>>> bootstrapping > >> >>>>>> the > >> >>>>>>>>>> metadata.". By "bootstrapping the metadata" do we mean > >> >>> actually > >> >>>>>>>>> contacting > >> >>>>>>>>>> the bootstrap servers, or just setting some internal state > >> >>>>>> related to > >> >>>>>>>> the > >> >>>>>>>>>> current set of servers that can be contacted for metadata? I > >> >>> ask > >> >>>>>>>> because > >> >>>>>>>>> it > >> >>>>>>>>>> seems like the language here implies the former, but if > >> >> that's > >> >>>> the > >> >>>>>>>> case, > >> >>>>>>>>>> this is already happening in poll (or at least, the first > >> >>>>>> invocation > >> >>>>>>> of > >> >>>>>>>>>> it), and if it's the latter, it's probably not necessary to > >> >>>>>> mention > >> >>>>>>> in > >> >>>>>>>>> the > >> >>>>>>>>>> KIP since it doesn't really impact user-facing behavior. It > >> >>> also > >> >>>>>>> seems > >> >>>>>>>>> like > >> >>>>>>>>>> that detail might impact how intertwined this and KIP-899 > >> >> are, > >> >>>>>> though > >> >>>>>>>> the > >> >>>>>>>>>> similarity could still be superficial either way. > >> >>>>>>>>>> > >> >>>>>>>>>> Cheers, > >> >>>>>>>>>> > >> >>>>>>>>>> Chris > >> >>>>>>>>>> > >> >>>>>>>>>> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee < > >> >>> philip...@gmail.com > >> >>>>> > >> >>>>>>>> wrote: > >> >>>>>>>>>> > >> >>>>>>>>>>> Hey Ismael, > >> >>>>>>>>>>> > >> >>>>>>>>>>> Thanks for the feedback! The proposal is not to retry > >> >>>>>> automatically > >> >>>>>>>> but > >> >>>>>>>>>>> relies on the user polling the NetworkClient (basically, > >> >>>>>>>> consumer.poll) > >> >>>>>>>>>> to > >> >>>>>>>>>>> reattempt the bootstrap. If bootstrapping fails, a > >> >>>>>> NetworkException > >> >>>>>>>>>>> (retriable) will be thrown. > >> >>>>>>>>>>> > >> >>>>>>>>>>> Thanks! > >> >>>>>>>>>>> P > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma < > >> >>>> ism...@juma.me.uk> > >> >>>>>>>> wrote: > >> >>>>>>>>>>> > >> >>>>>>>>>>>> Thanks for the KIP. Not sure if I missed it, but how > >> >> long > >> >>>>>> will we > >> >>>>>>>>> retry > >> >>>>>>>>>>> for > >> >>>>>>>>>>>> and when do we give up and propagate the failure to the > >> >>>> user? > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> Ismael > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee < > >> >>>>>> philip...@gmail.com> > >> >>>>>>>>>> wrote: > >> >>>>>>>>>>>> > >> >>>>>>>>>>>>> Hi all! > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> I want to start a discussion thread about how we can > >> >>>> handle > >> >>>>>>>> client > >> >>>>>>>>>>>>> bootstrap failure due DNS lookup. This requires a bit > >> >>> of > >> >>>>>>>>> behavioral > >> >>>>>>>>>>>>> change, so a KIP is proposed and attached to this > >> >> email. > >> >>>>>> Let me > >> >>>>>>>>> know > >> >>>>>>>>>>> what > >> >>>>>>>>>>>>> you think! > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> *A small remark here*: *As the title of this KIP might > >> >>>> sound > >> >>>>>>>>>>>>> familiar/similar to KIP-899, it is not the same.* > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> *In Summary:* I want to propose a KIP to change the > >> >>>> existing > >> >>>>>>>>>> bootstrap > >> >>>>>>>>>>>>> (upon instantiation) strategy because it is reasonable > >> >>> to > >> >>>>>> allow > >> >>>>>>>>>> clients > >> >>>>>>>>>>>> to > >> >>>>>>>>>>>>> retry > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> *KIP: * > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>> > >> >>> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> Thanks! > >> >>>>>>>>>>>>> Philip > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > >> >