Cheers Kafka Community, I just wanna give this thread bump, as it has been a bit quiet for the past week. I have not updated the KIP based on Chris and Jason's feedback, as I would also like to know more about what do people think.
Jason - Thanks for the suggestion, I think your suggestion makes a lot of sense. Thanks! P On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson <ja...@confluent.io.invalid> wrote: > Hi Philip, > > > Having an overall timeout also seems reasonable, but I wonder what should > the client do after running out of the time? Should we throw a > non-retriable exception (instead of TimeoutExceptoin to stop the client > from retrying) and alert the user to examine the config and the DNS server? > > Yeah, not sure exactly. I'd probably suggest a > `BootstrapConnectionException` or something like that with a clear message > indicating the problem. What the user does with it is up to them, but at > least it gives them the option to fail their application if that is what > they prefer to do in this case. If they catch it and ignore it, I would > expect the client to just continue retrying. Logging for bootstrap > dns/connection failures will be helpful in any case. > > -Jason > > > > > > > On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <philip...@gmail.com> wrote: > > > Jason: > > Thanks for the feedback. Now giving it a second thought, I think your > > suggestion of logging the error might make sense, as I could imagine most > > users would just continue to retry, so it might not be necessary to throw > > an exception anyway. > > Having an overall timeout also seems reasonable, but I wonder what should > > the client do after running out of the time? Should we throw a > > non-retriable exception (instead of TimeoutExceptoin to stop the client > > from retrying) and alert the user to examine the config and the DNS > server? > > > > Chris: > > I feel I still haven't answered your question about the pre-flight check, > > as it seems exposing an API might be harder to push through. > > > > Thanks! > > P > > > > On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson > > <ja...@confluent.io.invalid> > > wrote: > > > > > One more random thought I had just as I pushed send. We're currently > > > treating this problem somewhat narrowly by focusing only on the DNS > > > resolution of the bootstrap servers. Even if the servers resolve, > there's > > > no guarantee that they are reachable by the client. Would it make sense > > to > > > have a timeout which bounds the total time that the client should wait > to > > > connect to the bootstrap servers? Something like ` > > > bootstrap.servers.connection.timeout.ms`. > > > > > > -Jason > > > > > > On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > Hi Philip, > > > > > > > > An alternative is not to fail at all. Every other network error is > > caught > > > > and handled internally in the client. We see this case as different > > > because > > > > a DNS resolution error may imply misconfiguration. Could it also > imply > > > that > > > > the DNS server is unavailable? I'm not sure why that case should be > > > handled > > > > differently than if the bootstrap servers themselves are unavailable. > > > Would > > > > it be enough to log a clear warning in the logs if the bootstrap > > servers > > > > could not resolve? > > > > > > > > On the whole, the current fail-fast approach probably does more good > > than > > > > bad, but it does seem somewhat inconsistent overall and my guess is > > that > > > > dynamic environments will become increasingly common. It would be > nice > > to > > > > have a reasonable default behavior which could handle these cases > > > > gracefully without any additional logic. In any case, it would be > nice > > to > > > > see this option in the rejected alternatives at least if we do not > take > > > it. > > > > > > > > If we want to take the route of throwing an exception, then I think > > we're > > > > probably going to need a new configuration since I can't see what a > > > > reasonable timeout we would use as a default. The benefit of a > > > > configuration is that it would let us retain the current default > > behavior > > > > with timeout effectively set to 0 and it would also let users > > effectively > > > > disable the timeout by using a very large value. Otherwise, it seems > > > like a > > > > potential compatibility break to have a new exception type thrown at > > some > > > > arbitrary time without giving the user any control over it. > > > > > > > > Thanks, > > > > Jason > > > > > > > > > > > > On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton > <chr...@aiven.io.invalid > > > > > > > wrote: > > > > > > > >> Hi Philip, > > > >> > > > >> Yeah, it's basically DNS resolution we're talking about, though > > there's > > > >> some additional subtlety there with the logic introduced by KIP-235 > > [1]. > > > >> Essentially it should cover any scenario that causes a client > > > constructor > > > >> to fail with the current logic but would not after this KIP is > > released. > > > >> > > > >> We can generalize the Connect use case like this: a client > application > > > >> that > > > >> may connect to different Kafka clusters with a public-facing, > > > easy-to-use > > > >> API for restarting failed tasks and automatic handling of retriable > > > >> exceptions. The ease with which failed tasks can be restarted is > > > >> significant because it reduces the cost of failing on non-retriable > > > >> exceptions and makes fail-fast behavior easier to work with. And, in > > > cases > > > >> like this where we can't really know whether the error we're dealing > > > with > > > >> is retriable or not, it's better IMO to continue to allow > applications > > > >> like > > > >> these to fail fast. I do agree that it'd be nice to get input from > the > > > >> community, though. > > > >> > > > >> I was toying with the idea of a NetworkException subclass too. It's > a > > > >> simpler API, but it doesn't allow for preflight validation, which > can > > be > > > >> useful in scenarios where submitting new configurations for client > > > >> applications is expensive in terms of time or resources. Then > again, I > > > >> don't see why the two are mutually exclusive, and we might opt to > use > > > the > > > >> NetworkException subclass in this KIP and pursue an opt-in > validation > > > API > > > >> later on. Thoughts? > > > >> > > > >> [1] - > > > >> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection > > > >> > > > >> Cheers, > > > >> > > > >> Chris > > > >> > > > >> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <philip...@gmail.com> > > wrote: > > > >> > > > >> > Hey Chris, > > > >> > > > > >> > Thanks again for the feedback! > > > >> > > > > >> > > > > >> > For the preflight DNS check (are we basically trying to resolve > the > > > DNS > > > >> > there?): Maybe it makes more sense to add it to the Config > modules? > > I > > > >> would > > > >> > like to hear what the community says as I'm not familiar with the > > > >> Connect > > > >> > use case. > > > >> > > > > >> > A "slower failing" alternative - I wonder if it makes sense for us > > to > > > >> > extend the NetworkException so that clients can be smarter at > > handling > > > >> > these exceptions. Of course, it is still retriable and requires > > > polling > > > >> the > > > >> > consumer, but then we can distinguish the DNS resolution error > from > > > >> other > > > >> > network errors. > > > >> > > > > >> > Thanks! > > > >> > P > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton > > <chr...@aiven.io.invalid > > > > > > > >> > wrote: > > > >> > > > > >> > > Hi Philip, > > > >> > > > > > >> > > Yeah, "DNS resolution should occur..." seems like a better fit. > > 👍 > > > >> > > > > > >> > > One other question I have is whether we should expose some kind > of > > > >> public > > > >> > > API for performing preflight validation of the bootstrap URLs. > If > > we > > > >> > change > > > >> > > the behavior of a client configured with a silly typo (e.g., > > > >> > > "loclahost instead of localhost") from failing in the > constructor > > to > > > >> > > failing with a retriable exception, this might lead some client > > > >> > > applications to handle that failure by, well, retrying. For > > > reference, > > > >> > this > > > >> > > is exactly what we do in Kafka Connect right now; see [1] and > [2]. > > > IMO > > > >> > it'd > > > >> > > be nice to be able to opt into keeping the current behavior so > > that > > > >> > > projects like Connect could still do preflight checks of the > > > >> > > bootstrap.servers property for connectors before starting them, > > and > > > >> > report > > > >> > > any issues by failing fast instead of continuously writing > > > >> warning/error > > > >> > > messages to their logs. > > > >> > > > > > >> > > I'm not sure about where this new API could go, but a few > options > > > >> might > > > >> > be: > > > >> > > > > > >> > > - Expose a public variant of the existing ClientUtils class > > > >> > > - Add static methods to the ConsumerConfig, ProducerConfig, and > > > >> > > AdminClientConfig classes > > > >> > > - Add those same static methods to the KafkaConsumer, > > KafkaProducer, > > > >> and > > > >> > > KafkaAdminClient classes > > > >> > > > > > >> > > If this seems reasonable, we should probably also specify in the > > KIP > > > >> that > > > >> > > Kafka Connect will leverage this preflight validation logic > before > > > >> > > instantiating any Kafka clients for use by connectors or tasks, > > and > > > >> > > continue to fail fast if there are typos in the > bootstrap.servers > > > >> > property, > > > >> > > or if temporary DNS resolution issues come up. > > > >> > > > > > >> > > [1] - > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > > https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606 > > > >> > > [2] - > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > > https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439 > > > >> > > > > > >> > > Cheers, > > > >> > > > > > >> > > Chris > > > >> > > > > > >> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com > > > > > >> wrote: > > > >> > > > > > >> > > > Hey Chris, > > > >> > > > > > > >> > > > Thanks for the quick response, and I apologize for the unclear > > > >> wording > > > >> > > > there, I guess "DNS lookup" would be a more appropriate > wording > > > >> here. > > > >> > So > > > >> > > > what I meant there was, to delegate the DNS lookup in the > > > >> constructor > > > >> > to > > > >> > > > the network client poll, and it will happen on the very first > > > >> poll. I > > > >> > > > guess the logic could look like this: > > > >> > > > > > > >> > > > - if the client has been bootstrapped, do nothing. > > > >> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap > > server > > > >> > > address. > > > >> > > > > > > >> > > > Thanks for the comment there, I'll change up the wording. > Maybe > > > >> revise > > > >> > > it > > > >> > > > as "DNS resolution should occur in the poll...." ? > > > >> > > > > > > >> > > > P > > > >> > > > > > > >> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton > > > >> <chr...@aiven.io.invalid > > > >> > > > > > >> > > > wrote: > > > >> > > > > > > >> > > > > Hi Philip, > > > >> > > > > > > > >> > > > > Thanks for the KIP! > > > >> > > > > > > > >> > > > > QQ: In the "Proposed Changes" section, the KIP states that > > > >> > > "Bootstrapping > > > >> > > > > should now occur in the poll method before attempting to > > update > > > >> the > > > >> > > > > metadata. This includes resolving the addresses and > > > bootstrapping > > > >> the > > > >> > > > > metadata.". By "bootstrapping the metadata" do we mean > > actually > > > >> > > > contacting > > > >> > > > > the bootstrap servers, or just setting some internal state > > > >> related to > > > >> > > the > > > >> > > > > current set of servers that can be contacted for metadata? I > > ask > > > >> > > because > > > >> > > > it > > > >> > > > > seems like the language here implies the former, but if > that's > > > the > > > >> > > case, > > > >> > > > > this is already happening in poll (or at least, the first > > > >> invocation > > > >> > of > > > >> > > > > it), and if it's the latter, it's probably not necessary to > > > >> mention > > > >> > in > > > >> > > > the > > > >> > > > > KIP since it doesn't really impact user-facing behavior. It > > also > > > >> > seems > > > >> > > > like > > > >> > > > > that detail might impact how intertwined this and KIP-899 > are, > > > >> though > > > >> > > the > > > >> > > > > similarity could still be superficial either way. > > > >> > > > > > > > >> > > > > Cheers, > > > >> > > > > > > > >> > > > > Chris > > > >> > > > > > > > >> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee < > > philip...@gmail.com > > > > > > > >> > > wrote: > > > >> > > > > > > > >> > > > > > Hey Ismael, > > > >> > > > > > > > > >> > > > > > Thanks for the feedback! The proposal is not to retry > > > >> automatically > > > >> > > but > > > >> > > > > > relies on the user polling the NetworkClient (basically, > > > >> > > consumer.poll) > > > >> > > > > to > > > >> > > > > > reattempt the bootstrap. If bootstrapping fails, a > > > >> NetworkException > > > >> > > > > > (retriable) will be thrown. > > > >> > > > > > > > > >> > > > > > Thanks! > > > >> > > > > > P > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma < > > > ism...@juma.me.uk> > > > >> > > wrote: > > > >> > > > > > > > > >> > > > > > > Thanks for the KIP. Not sure if I missed it, but how > long > > > >> will we > > > >> > > > retry > > > >> > > > > > for > > > >> > > > > > > and when do we give up and propagate the failure to the > > > user? > > > >> > > > > > > > > > >> > > > > > > Ismael > > > >> > > > > > > > > > >> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee < > > > >> philip...@gmail.com> > > > >> > > > > wrote: > > > >> > > > > > > > > > >> > > > > > > > Hi all! > > > >> > > > > > > > > > > >> > > > > > > > I want to start a discussion thread about how we can > > > handle > > > >> > > client > > > >> > > > > > > > bootstrap failure due DNS lookup. This requires a bit > > of > > > >> > > > behavioral > > > >> > > > > > > > change, so a KIP is proposed and attached to this > email. > > > >> Let me > > > >> > > > know > > > >> > > > > > what > > > >> > > > > > > > you think! > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > *A small remark here*: *As the title of this KIP might > > > sound > > > >> > > > > > > > familiar/similar to KIP-899, it is not the same.* > > > >> > > > > > > > > > > >> > > > > > > > *In Summary:* I want to propose a KIP to change the > > > existing > > > >> > > > > bootstrap > > > >> > > > > > > > (upon instantiation) strategy because it is reasonable > > to > > > >> allow > > > >> > > > > clients > > > >> > > > > > > to > > > >> > > > > > > > retry > > > >> > > > > > > > > > > >> > > > > > > > *KIP: * > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution > > > >> > > > > > > > > > > >> > > > > > > > Thanks! > > > >> > > > > > > > Philip > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > >