Hi Rajini. Agreed, we will bump the SaslHandshakeRequest API number (no wire format change, of course).
Agreed, responding to re-authentication will always be enabled on the broker since it is initiated by the client. Agreed, connection termination by the broker will be opt-in. I'm thinking *sasl.server.disconnect.expired.credential.connections.enable*, prefixed with listener and SASL mechanism name in lower-case so it can be opt-in at the granularity of a SASL mechanism; for example, " listener.name.sasl_ssl.oauthbearer.sasl.server.disconnect.expired.credential.connections.enable=true ". It is long, so if you have a preference for a shorter name let me know and we'll go with it instead. Agreed, additional metrics will be helpful. I'll add them to the KIP. <<<need to work out exactly how this works with all SASL mechanisms <<<not sure we have the data required on the broker or a way of <<< extending the [GSSAPI] mechanism Not sure I agree here. The paragraph I added to the KIP describes how I think this can be done. Given that description, do you still feel Kerberos will not be possible? If Kerberos presents a significant hurdle then I don't think that should prevent us from doing it for other mechanisms -- I would rather state that it isn't supported with GSSAPI due to limitations in Kerberos itself than not have it for OAUTHBEARER, for example. And right now I don't think we need more than a high-level description of how this could be done. In other words, I think we have this covered, unless there is a fundamental problem due to Kerberos not making data (i.e. ticket expiration) available on the server. I want to submit this KIP for a vote early next week in the hope of having it accepted by the Monday, 9/24 deadline for the 2.1.0 release, and if that happens I believe I can get a PR into really good shape soon thereafter (I'm working on it now). Ron On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram <rajinisiva...@gmail.com> wrote: > Hi Ron, > > 1) Yes, we should update the version of `SaslHandshakeRequest`. Clients > would attempt to re-authenticate only if broker's SaslHandshakeRequest > version >=2. > 2) I think we should enable broker's re-authentication and connection > termination code regardless of client version. Re-authentication could be > always enabled since it is triggered only by clients and broker could just > handle it. Connection termination should be opt-in, but should work with > clients of any version. If turned on and clients are of an older version, > this would simply force reconnection, which should be ok. Additional > metrics to monitor expired connections may be useful anyway. > > We also need to work out exactly how this works with all SASL mechanisms. > In particular, we have ensure that we can make it work with Kerberos since > we use the implementation from the JRE and I am not sure we have the data > required on the broker or a way of extending the mechanism. > > Thoughts? > > On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino <rndg...@gmail.com> wrote: > > > Hi Rajini. I'm thinking about how we deal with migration. For example, > > let's say we have an existing 2.0.0 cluster using SASL/OAUTHBEARER and we > > want to use this feature. The desired end state is to have all clients > and > > brokers migrated to a version that supports the feature (call it 2.x) > with > > the feature turned on. We need to document the supported path(s) to this > > end state. > > > > Here are a couple of scenarios with implications: > > > > 1) *Migrate client to 2.x and turn the feature on but broker is still at > > 2.0.* In this case the client is going to get an error when it sends the > > SaslHandshakeRequest. We want to avoid this. It seems to me the client > > needs to know if the broker has been upgraded to the necessary version > > before trying to re-authenticate, which makes me believe we need to bump > > the API version number in 2.x and the client has to check to see if the > max > > version the broker supports meets a minimum standard before trying to > > re-authenticate. Do you agree? > > > > 2) *Migrate broker to 2.x and turn the feature on but client is still at > > 2.0*. In this case the broker is going to end up killing connections > > periodically. Again, we want to avoid this. It's tempting to say > "don't > > do this" but I wonder if we can require installations to upgrade all > > clients before turning the feature on at the brokers. Certainly we can > say > > "don't do this" for inter-broker clients -- in other words, we can say > that > > all brokers in a cluster should be upgraded before the feature is turned > on > > for any one of them -- but I don't know about our ability to say it for > > non-broker clients. > > > > Now we consider the cases where both the brokers and the clients have > been > > upgraded to 2.x. When and where should the feature be turned on? The > > inter-broker case is interesting because I don't think think we can > require > > installations to restart every broker with a new config where the feature > > is turned on at the same time. Do you agree that we cannot require this > > and therefore must support installations restarting brokers with a new > > config at whatever pace they need -- which may be quite slow relative to > > token lifetimes? Assuming you do agree, then there is going to be the > case > > where some brokers are going to have the feature turned on and some > clients > > (definitely inter-broker clients at a minimum) are going to have the > > feature turned off. > > > > The above discussion assumes a single config on the broker side that > turns > > on both the inter-broker client re-authentication feature as well as the > > server-side expired-connection-kill feature, but now I'm thinking we need > > the ability to turn these features on independently, plus maybe we need a > > way to monitor the number of "active, expired" connections to the server > so > > that operators can be sure that all clients have been upgraded/enabled > > before turning on the server-side expired-connection-kill feature. > > > > So the migration plan would be as follows: > > > > 1) Upgrade all brokers to 2.x. > > 2) After all brokers are upgraded, turn on re-authentication for all > > brokers at whatever rate is desired -- just eventually, at some point, > get > > the client-side feature turned on for all brokers so that inter-broker > > connections are re-authenticating. > > 3) In parallel with (1) and (2) above, upgrade clients to 2.x and turn > > their re-authentication feature on. Clients will check the API version > and > > only re-authenticate to a broker that has also been upgraded (note that > the > > ability of a broker to respond to a re-authentication cannot be turned > off > > -- it is always on beginning with version 2.x, and it just sits there > doing > > nothing if it isn't exercised by an enabled client) > > 4) After (1), (2), and (3) are complete, check the 2.x broker metrics to > > confirm that there are no "active, expired" connections. Once you are > > satisfied that (1), (2), and (3) are indeed complete you can enable the > > server-side expired-connection-kill feature on each broker via a restart > at > > whatever pace is desired. > > > > Comments? > > > > Ron > > > > > > On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino <rndg...@gmail.com> wrote: > > > > > Ok, I am tempted to just say we go with the low-level approach since it > > is > > > the quickest and seems to meet the clear requirements. We can always > > adjust > > > later if we get to clarity on other requirements or we decide we need > to > > > approach it differently for whatever reason. But in the meantime, > before > > > fully committing to this decision, I would appreciate another > perspective > > > if someone has one. > > > > > > Ron > > > > > > On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram < > rajinisiva...@gmail.com> > > > wrote: > > > > > >> Hi Ron, > > >> > > >> Yes, I would leave out retries from this KIP for now. In the future, > if > > >> there is a requirement for supporting retries, we can consider it. I > > think > > >> we can support retries with either approach if we needed to, but it > > would > > >> be better to do it along with other changes required to support > > >> authentication servers that are not highly available. > > >> > > >> For maintainability, I am biased, so it will be good to get the > > >> perspective > > >> of others in the community :-) > > >> > > >> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <rndg...@gmail.com> > > wrote: > > >> > > >> > Hi Rajini. Here is some feedback/some things I thought of. > > >> > > > >> > First, I just realized that from a timing perspective that I am not > > sure > > >> > retry is going to be necessary. The background login refresh thread > > >> > triggers re-authentication when it refreshes the client's > credential. > > >> The > > >> > OAuth infrastructure has to be available in order for this refresh > to > > >> > succeed (the background thread repeatedly retries if it can't > refresh > > >> the > > >> > credential, and that retry loop handles any temporary outage). If > > >> clients > > >> > are told to re-authenticate when the credential is refreshed **and > > they > > >> > actually re-authenticate at that point** then it is highly unlikely > > that > > >> > the OAuth infrastructure would fail within those intervening > > >> milliseconds. > > >> > So we don't need re-authentication retry in this KIP as long as > retry > > >> > starts immediately. The low-level prototype as currently coded > > doesn't > > >> > actually start re-authentication until the connection is > subsequently > > >> used, > > >> > and that could take a while. But then again, if the connection > isn't > > >> used > > >> > for some period of time, then the lost value of having to abandon > the > > >> > connection is lessened anyway. Plus, as you pointed out, OAuth > > >> > infrastructure is assumed to be highly available. > > >> > > > >> > Does this makes sense, and would you be willing to say that retry > > isn't > > >> a > > >> > necessary requirement? I'm tempted but would like to hear your > > >> perspective > > >> > first. > > >> > > > >> > Interleaving/latency and maintainability (more than lines of code) > are > > >> the > > >> > two remaining areas of comparison. I did not add the > maintainability > > >> issue > > >> > to the KIP yet, but before I do I thought I would address it here > > first > > >> to > > >> > see if we can come to consensus on it because I'm not sure I see the > > >> > high-level approach as being hard to maintain (yet -- I could be > > >> > convinced/convince myself; we'll see). I do want to make sure we > are > > on > > >> > the same page about what is required to add re-authentication > support > > in > > >> > the high-level case. Granted, the amount is zero in the low-level > > case, > > >> > and it doesn't get any better that that, but the amount in the > > >> high-level > > >> > case is very low -- just a few lines of code. For example: > > >> > > > >> > KafkaAdminClient: > > >> > https://github.com/apache/kafka/pull/5582/commits/4fa70f38b9 > > >> > d33428ff98b64a3a2bd668f5f28c38#diff-6869b8fccf6b098cbcb0676e8ceb26a7 > > >> > It is the same few lines of code for KafkaConsumer, KafkaProducer, > > >> > WorkerGroupMember, and TransactionMarkerChannelManager > > >> > > > >> > The two synchronous I/O use cases are ControllerChannelManager and > > >> > ReplicaFetcherBlockingSend (via ReplicaFetcherThread), and they > > require > > >> a > > >> > little bit more -- but not much. > > >> > > > >> > Thoughts? > > >> > > > >> > Ron > > >> > > > >> > On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <rndg...@gmail.com> > > >> wrote: > > >> > > > >> > > Thanks, Rajini. Before I digest/respond to that, here's an update > > >> that I > > >> > > just completed. > > >> > > > > >> > > I added a commit to the PR ( > > >> https://github.com/apache/kafka/pull/5582/) > > >> > > that implements server-side kill of expired OAUTHBEARER > connections. > > >> No > > >> > > tests yet since we still haven't settled on a final approach > > >> (low-level > > >> > vs. > > >> > > high-level). > > >> > > > > >> > > I also updated the KIP to reflect the latest discussion and PR as > > >> > follows: > > >> > > > > >> > > - Include support for brokers killing connections as part of > this > > >> KIP > > >> > > (rather than deferring it to a future KIP as was originally > > >> > mentioned; the > > >> > > PR now includes it as mentioned -- it was very easy to add) > > >> > > - Added metrics (they will mirror existing ones related to > > >> > > authentications; I have not added those to the PR) > > >> > > - Updated the implementation description to reflect the current > > >> state > > >> > > of the PR, which is a high-level, one-size-fits-all approach > (as > > >> > opposed to > > >> > > my initial, even-higher-level approach) > > >> > > - Added a "Rejected Alternative" for the first version of the > PR, > > >> > > which injected requests directly into synchronous I/O clients' > > >> queues > > >> > > - Added a "Rejected Alternative" for the low-level approach as > > >> > > suggested, but of course we have not formally decided to reject > > >> this > > >> > > approach or adopt the current PR implementation. > > >> > > > > >> > > I'll think about where we stand some more before responding again. > > >> > Thanks > > >> > > for the above reply. > > >> > > > > >> > > Ron > > >> > > > > >> > > On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram < > > >> rajinisiva...@gmail.com> > > >> > > wrote: > > >> > > > > >> > >> Hi Ron, > > >> > >> > > >> > >> Thank you for summarising, I think it covers the differences > > between > > >> the > > >> > >> two approaches well. > > >> > >> > > >> > >> A few minor points to answer the questions in there: > > >> > >> > > >> > >> 1) When re-authetication is initiated in the Selector during > > poll(), > > >> we > > >> > >> can > > >> > >> move an idle channel to re-authentication state. It is similar to > > >> > >> injecting > > >> > >> requests, but achieved by changing channel back to authenticating > > >> state. > > >> > >> > > >> > >> 3) To clarify why I think re-authentication should fit in with > our > > >> > >> authentication design: My point was not about a specific > connection > > >> > being > > >> > >> usable or not usable. It was about what happens at the client API > > >> level. > > >> > >> Our client API (producer/consumer/admin client etc.) currently > > assume > > >> > that > > >> > >> a single broker authentication failure is a fatal error that is > > never > > >> > >> retried because we assume that broker only ever fails an > > >> authentication > > >> > >> request if credentials are invalid. If we ever decide to support > > >> cases > > >> > >> where broker occasionally fails an authentication request due to > a > > >> > >> transient failure, we need to do more around how we handle > > >> > authentication > > >> > >> failures in clients. We may decide that it is ok to close the > > >> connection > > >> > >> for authentication and not for re-authentication as you > mentioned, > > >> but > > >> > we > > >> > >> need to change the way this disconnection is handled by clients. > So > > >> IMO, > > >> > >> we > > >> > >> should either add support for transient retriable authentication > > >> > failures > > >> > >> properly or not retry for any scenario. Personally, I don't think > > we > > >> > would > > >> > >> want to retry all authentication failures even if it is a > > >> > >> re-authentication, I think we could (at some point in future), > > allow > > >> > >> brokers to return an error code that indicates that it is a > > transient > > >> > >> broker-side failure rather than invalid credentials and handle > the > > >> error > > >> > >> differently. I see no reason at that point why we wouldn't handle > > >> > >> authentication and re-authentication in the same way. > > >> > >> > > >> > >> 4) As you said, the high-level approach would be bigger than the > > >> > low-level > > >> > >> approach in terms of LOC. But I wouldn't be too concerned about > > >> lines of > > >> > >> code. My bigger concern was about modularity. Our security code > is > > >> > already > > >> > >> complex, protocols like Kerberos and SSL that we use from the JRE > > >> make > > >> > >> problem diagnosis hard. Async I/O makes the networking code > > complex. > > >> You > > >> > >> need to understand networking layer to work with the security > > layer, > > >> but > > >> > >> the rest of the code base doesn't rely on knowledge of > > >> network/security > > >> > >> layers. My main concern about the high-level approach is that it > > >> spans > > >> > >> these boundaries, making it harder to maintain in the long run. > > >> > >> > > >> > >> > > >> > >> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski < > > >> > >> stanis...@confluent.io> wrote: > > >> > >> > > >> > >> > Hi Ron, Rajini > > >> > >> > > > >> > >> > Thanks for summarizing the discussion so far, Ron! > > >> > >> > > > >> > >> > 1) How often do we have such long-lived connection idleness > (e.g > > >> 5-10 > > >> > >> > minutes) in practice? > > >> > >> > > > >> > >> > 3) I agree that retries for re-authentication are useful. > > >> > >> > > > >> > >> > 4) The interleaving of requests sounds like a great feature to > > >> have, > > >> > but > > >> > >> > the tradeoff against code complexity is a tough one. I would > > >> > personally > > >> > >> go > > >> > >> > with the simpler approach since you could always add > interleaving > > >> on > > >> > >> top if > > >> > >> > the community decides the latency should be better. > > >> > >> > > > >> > >> > Best, > > >> > >> > Stanislav > > >> > >> > > > >> > >> > On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino < > rndg...@gmail.com > > > > > >> > >> wrote: > > >> > >> > > > >> > >> > > Hi everyone. I've updated the PR to reflect the latest > > >> conclusions > > >> > >> from > > >> > >> > > this ongoing discussion. The KIP still needs the suggested > > >> > updates; I > > >> > >> > will > > >> > >> > > do that later this week. II agree with Rajini that some > > >> additional > > >> > >> > > feedback from the community at large would be very helpful at > > >> this > > >> > >> point > > >> > >> > in > > >> > >> > > time. > > >> > >> > > > > >> > >> > > Here's where we stand. > > >> > >> > > > > >> > >> > > We have 2 separate implementations for re-authenticating > > existing > > >> > >> > > connections -- a so-called "low-level" approach and a > > (relatively > > >> > >> > speaking) > > >> > >> > > "high-level" approach -- and we agree on how they should be > > >> > compared. > > >> > >> > > Specifically, Rajini has provided a mechanism that works at a > > >> > >> relatively > > >> > >> > > low level in the stack by intercepting network sends and > > queueing > > >> > >> them up > > >> > >> > > while re-authentication happens; then the queued sends are > sent > > >> > after > > >> > >> > > re-authentication succeeds, or the connection is closed if > > >> > >> > > re-authentication fails. See the prototype commit at > > >> > >> > > > > >> > >> > > https://github.com/rajinisivaram/kafka/commit/b9d711907ad843 > > >> > >> > c11d17e80d6743bfb1d4e3f3fd > > >> > >> > > . > > >> > >> > > > > >> > >> > > I have an implementation that works higher up in the stack; > it > > >> > injects > > >> > >> > > authentication requests into the KafkaClient via a new method > > >> added > > >> > to > > >> > >> > that > > >> > >> > > interface, and the implementation (i.e. NetworkClient) sends > > them > > >> > when > > >> > >> > > poll() is called. The updated PR is available at > > >> > >> > > https://github.com/apache/kafka/pull/5582/. > > >> > >> > > > > >> > >> > > Here is how we think these two approaches should be compared. > > >> > >> > > > > >> > >> > > 1) *When re-authentication starts*. The low-level approach > > >> > initiates > > >> > >> > > re-authentication only if/when the connection is actually > used, > > >> so > > >> > it > > >> > >> may > > >> > >> > > start after the existing credential expires; the current PR > > >> > >> > implementation > > >> > >> > > injects re-authentication requests into the existing flow, > and > > >> > >> > > re-authentication starts immediately regardless of whether or > > not > > >> > the > > >> > >> > > connection is being used for something else. Rajini believes > > the > > >> > >> > low-level > > >> > >> > > approach can be adjusted to re-authenticate idle connections > > >> > (Rajini, > > >> > >> I > > >> > >> > > don't see how this can be done without injecting requests, > > which > > >> is > > >> > >> what > > >> > >> > > the high-level connection is doing, but I am probably missing > > >> > >> something > > >> > >> > -- > > >> > >> > > no need to go into it unless it is easy to explain.) > > >> > >> > > > > >> > >> > > 2) *When requests not related to re-authentication are able > to > > >> use > > >> > the > > >> > >> > > re-authenticating connection*. The low-level approach > finishes > > >> > >> > > re-authentication completely before allowing anything else to > > >> > traverse > > >> > >> > the > > >> > >> > > connection, and we agree that this is how the low-level > > >> > implementation > > >> > >> > must > > >> > >> > > work without significant work/changes. The current PR > > >> > implementation > > >> > >> > > interleaves re-authentication requests with the existing flow > > in > > >> the > > >> > >> > > asynchronous I/O uses cases (Consumer, Producer, Admin > Client, > > >> > etc.), > > >> > >> and > > >> > >> > > it allows the two synchronous use cases > > (ControllerChannelManager > > >> > and > > >> > >> > > ReplicaFetcherBlockingSend/ReplicaFetcherThread) to decide > > which > > >> > style > > >> > >> > they > > >> > >> > > want. I (somewhat arbitrarily, to prove out the flexibility > of > > >> the > > >> > >> > > high-level approach) coded ControllerChannelManager to do > > >> complete, > > >> > >> > > non-interleaved re-authentication and > > ReplicaFetcherBlockingSend/ > > >> > >> > > ReplicaFetcherThread to interleave requests. The approach > has > > >> > >> impacts on > > >> > >> > > the maximum size of latency spikes: interleaving requests can > > >> > decrease > > >> > >> > the > > >> > >> > > latency. The benefit of interleaving is tough to evaluate > > >> because > > >> > it > > >> > >> > isn't > > >> > >> > > clear what the latency requirement really is. Note that > > >> > >> > re-authentication > > >> > >> > > can entail several network round-trips between the client and > > the > > >> > >> broker. > > >> > >> > > Comments in this area would be especially appreciated. > > >> > >> > > > > >> > >> > > 3) *What happens if re-authentication fails (i.e. retry > > >> > >> capability)*. I > > >> > >> > > think this is where we have not yet settled on what the > > >> requirement > > >> > is > > >> > >> > > (even more so than the issue of potential latency mitigation > > >> > >> > > requirements). Rajini, you had mentioned that > > re-authentication > > >> > >> should > > >> > >> > > work the same way as authentication, but I see the two > > >> situations as > > >> > >> > being > > >> > >> > > asymmetric. In the authentication case, if authentication > > fails, > > >> > the > > >> > >> > > connection was never fully established and could not be used, > > >> and it > > >> > >> is > > >> > >> > > closed -- TLS negotiation may have taken place, so that > effort > > is > > >> > >> lost, > > >> > >> > but > > >> > >> > > beyond that nothing else is lost. In the re-authentication > > case > > >> the > > >> > >> > > connection is fully established, it is in use, and in fact it > > can > > >> > >> remain > > >> > >> > in > > >> > >> > > use for some time going forward as long as the existing > > >> credential > > >> > >> > remains > > >> > >> > > unexpired; abandoning it at this point due to a > > re-authentication > > >> > >> failure > > >> > >> > > (which can occur due to no fault of the client -- i.e. a > remote > > >> > >> directory > > >> > >> > > or OAuth server being temporarily down) abandons a fully > > >> functioning > > >> > >> > > connection with remaining lifetime. Am I thinking about this > > >> > >> reasonably? > > >> > >> > > I still tend to believe that retries of failed > > re-authentications > > >> > are > > >> > >> a > > >> > >> > > requirement. > > >> > >> > > > > >> > >> > > 4) *Size*. The low-level approach is smaller in terms of > code. > > >> It > > >> > is > > >> > >> a > > >> > >> > > prototype, so it doesn't have everything the high-level > > approach > > >> > has, > > >> > >> > but I > > >> > >> > > have split the PR for the high-level approach into two > separate > > >> > >> commits > > >> > >> > to > > >> > >> > > facilitate a more apples-to-apples comparison: the first > commit > > >> > >> contains > > >> > >> > > infrastructure that would have to be added to the low-level > > >> > prototype > > >> > >> if > > >> > >> > we > > >> > >> > > went with that, and the second commit has everything specific > > to > > >> the > > >> > >> > > high-level approach. So comparing the low-level prototype > > commit > > >> > >> (~350 > > >> > >> > > changes) to the second commit in the high-level PR (~1400 > > >> changes) > > >> > is > > >> > >> > > reasonable. The high-level approach at this point is about 4 > > >> times > > >> > as > > >> > >> > big, > > >> > >> > > though it does have lots of error/sanity checking and > > >> functionality > > >> > >> (such > > >> > >> > > as retry) that would to at least some degree have to be added > > to > > >> the > > >> > >> > > low-level implementation. Still, the high-level approach is > > (and > > >> > will > > >> > >> > > likely remain) somewhat bigger than the low-level approach. > > >> > >> > > > > >> > >> > > I hope I presented this in a balanced and accurate way; > Rajini, > > >> > please > > >> > >> > > correct me if I got anything wrong or left out anything > > >> important. > > >> > >> > > Apologies in advance if that is the case -- it is unintended. > > >> > >> > > > > >> > >> > > Again, comments/discussion from the wider community at this > > >> juncture > > >> > >> > would > > >> > >> > > be especially appreciated. > > >> > >> > > > > >> > >> > > Ron > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > On Mon, Sep 10, 2018 at 7:42 AM Rajini Sivaram < > > >> > >> rajinisiva...@gmail.com> > > >> > >> > > wrote: > > >> > >> > > > > >> > >> > > > Hi Ron, > > >> > >> > > > > > >> > >> > > > Thank you for the analysis. Yes, I agree with the > differences > > >> you > > >> > >> have > > >> > >> > > > pointed out between the two approaches. > > >> > >> > > > > > >> > >> > > > 1. Re-authenticating idle connections (or connections > > >> nearing > > >> > >> idle > > >> > >> > > > timeout): With the lower-level approach, there is > nothing > > >> > >> stopping > > >> > >> > us > > >> > >> > > > from > > >> > >> > > > re-authenticating idle connections. We could either > > consider > > >> > idle > > >> > >> > > > connections for re-authentication or we could implement > > >> code in > > >> > >> > broker > > >> > >> > > > to > > >> > >> > > > kill connections that deals with delays introduced by > > >> idleness. > > >> > >> > Since > > >> > >> > > we > > >> > >> > > > expect token lifetimes to be much higher than connection > > >> expiry > > >> > >> > time, > > >> > >> > > > either would work. And the second approach is probably > > >> better. > > >> > >> > > > 2. Interleaving authentication requests and application > > >> > requests > > >> > >> to > > >> > >> > > > reduce impact on latency: It is not impossible to > > interleave > > >> > with > > >> > >> > the > > >> > >> > > > low-level approach, but it adds complexity and reduces > the > > >> > value > > >> > >> of > > >> > >> > > this > > >> > >> > > > approach. So it will be good to understand the > > requirements > > >> in > > >> > >> terms > > >> > >> > > of > > >> > >> > > > latency. In terms of comparing the two approaches, it is > > >> fair > > >> > to > > >> > >> > > compare > > >> > >> > > > non-interleaved low-level with the naturally interleaved > > >> > >> high-level > > >> > >> > > > approach. > > >> > >> > > > 3. Retries: I don't think requirement for retries is > > >> different > > >> > >> > between > > >> > >> > > > authentication and re-authentication from a client's > point > > >> of > > >> > >> view. > > >> > >> > We > > >> > >> > > > should either support retries for both or not at all. We > > >> > >> currently > > >> > >> > > > process > > >> > >> > > > authentication failures as fatal errors. We expect that > if > > >> you > > >> > >> are > > >> > >> > > using > > >> > >> > > > external authentication servers, those servers will be > > >> highly > > >> > >> > > available. > > >> > >> > > > Our built-in authentication mechanisms avoid accessing > > >> external > > >> > >> > > servers > > >> > >> > > > in > > >> > >> > > > the authentication path (during authentication, > > >> verification is > > >> > >> > > > performed > > >> > >> > > > using private/public keys or cache lookup). This is > > >> necessary > > >> > in > > >> > >> our > > >> > >> > > > design > > >> > >> > > > since authentications are performed synchronously on the > > >> > network > > >> > >> > > thread. > > >> > >> > > > > > >> > >> > > > One more point with the lower-level mechanism: If one day > we > > >> > decided > > >> > >> > that > > >> > >> > > > we want to extend this for SSL, it would be possible to > > change > > >> > just > > >> > >> the > > >> > >> > > > transport layer and make it work for SSL. We may never want > > to > > >> do > > >> > >> this, > > >> > >> > > but > > >> > >> > > > thought I would point out anyway. > > >> > >> > > > > > >> > >> > > > I think the next step would be to add the low-level > approach > > to > > >> > the > > >> > >> KIP > > >> > >> > > > under `Rejected Alternatives` with details on the > > differences. > > >> And > > >> > >> more > > >> > >> > > > detail on requirements covering latency and retries in the > > >> body of > > >> > >> the > > >> > >> > > > KIP. For "Rejected Alternatives", can we add sub-sections > (I > > >> > think > > >> > >> > there > > >> > >> > > > are two approaches there now, but you need to read through > to > > >> > figure > > >> > >> > that > > >> > >> > > > out, so sub-titles would help). > > >> > >> > > > > > >> > >> > > > Once the KIP is updated, it will be good to get more > feedback > > >> from > > >> > >> the > > >> > >> > > > community to decide on the approach to adopt. > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > On Sat, Sep 8, 2018 at 1:27 PM, Ron Dagostino < > > >> rndg...@gmail.com> > > >> > >> > wrote: > > >> > >> > > > > > >> > >> > > > > Hi yet again, Rajini. If we decide that interleaving is > > >> > >> impossible > > >> > >> > > with > > >> > >> > > > > the low-level approach but we still want to at least > > support > > >> the > > >> > >> > > > > possibility given that it reduces the size of latency > > spikes, > > >> > >> then I > > >> > >> > do > > >> > >> > > > > have a suggestion. I documented in the KIP how I > rejected > > a > > >> > >> single, > > >> > >> > > > > one-size-fits-all queue approach because we don't know > when > > >> > poll() > > >> > >> > will > > >> > >> > > > be > > >> > >> > > > > called in the synchronous I/O use case. An advantage of > > >> such an > > >> > >> > > approach > > >> > >> > > > > -- if it would have worked, which it didn't -- is that it > > >> would > > >> > >> have > > >> > >> > > been > > >> > >> > > > > transparent to the owner of the NetworkClient instance > > (this > > >> is > > >> > >> one > > >> > >> > of > > >> > >> > > > the > > >> > >> > > > > big advantages of the low-level approach, after all). > But > > we > > >> > >> could > > >> > >> > > make > > >> > >> > > > > the one-size-fits-all queue approach work if we are > willing > > >> to > > >> > >> impose > > >> > >> > > > some > > >> > >> > > > > requirements on synchronous I/O users of NetworkClient. > > >> > >> > Specifically, > > >> > >> > > we > > >> > >> > > > > could add a method to the > > >> org.apache.kafka.clients.KafkaClient > > >> > >> > > interface > > >> > >> > > > > (which NetworkClient implements) as follows: > > >> > >> > > > > > > >> > >> > > > > /** > > >> > >> > > > > > > >> > >> > > > > * Return true if any node has re-authentication > > requests > > >> > >> waiting > > >> > >> > > to > > >> > >> > > > be > > >> > >> > > > > sent. A > > >> > >> > > > > > > >> > >> > > > > * call to {@link #poll(long, long)} is required to > > send > > >> > such > > >> > >> > > > requests. > > >> > >> > > > > An owner > > >> > >> > > > > > > >> > >> > > > > * of this instance that does not implement a run > loop > > to > > >> > >> > > repeatedly > > >> > >> > > > > call > > >> > >> > > > > > > >> > >> > > > > * {@link #poll(long, long)} but instead only sends > > >> requests > > >> > >> > > > > synchronously > > >> > >> > > > > > > >> > >> > > > > * on-demand must call this method periodically -- > and > > >> > invoke > > >> > >> > > > > > > >> > >> > > > > * {@link #poll(long, long)} if the return value is > > >> {@code > > >> > >> true} > > >> > >> > -- > > >> > >> > > > to > > >> > >> > > > > ensure > > >> > >> > > > > > > >> > >> > > > > * that any re-authentication requests that have ben > > >> > injected > > >> > >> are > > >> > >> > > > sent > > >> > >> > > > > in a > > >> > >> > > > > > > >> > >> > > > > * timely fashion. > > >> > >> > > > > > > >> > >> > > > > * > > >> > >> > > > > > > >> > >> > > > > * @return true if any node has re-authentication > > >> requests > > >> > >> > waiting > > >> > >> > > to > > >> > >> > > > > be sent > > >> > >> > > > > > > >> > >> > > > > */ > > >> > >> > > > > > > >> > >> > > > > default boolean hasReauthenticationRequests() { > > >> > >> > > > > > > >> > >> > > > > return false; > > >> > >> > > > > > > >> > >> > > > > } > > >> > >> > > > > > > >> > >> > > > > If we are willing to add an additional method like this > and > > >> make > > >> > >> the > > >> > >> > > > minor > > >> > >> > > > > adjustments to the code associated with the few > synchronous > > >> use > > >> > >> cases > > >> > >> > > > then > > >> > >> > > > > I think the high-level approach will work. We would then > > >> have > > >> > the > > >> > >> > > > > possibility of further parameterizing the various > > synchronous > > >> > I/O > > >> > >> use > > >> > >> > > > > cases. For example, there are currently 3: > > >> > >> > > > > > > >> > >> > > > > ControllerChannelManager > > >> > >> > > > > KafkaProducer, via Sender > > >> > >> > > > > ReplicaFetcherBlockingSend, via ReplicaFetcherThread > > >> > >> > > > > > > >> > >> > > > > Is it conceivable that one of these use cases might be > more > > >> > >> > > > > latency-sensitive than others and might desire > > >> interleaving? A > > >> > >> > > > high-level > > >> > >> > > > > approach would give us the option of configuring each use > > >> case > > >> > >> > > > accordingly. > > >> > >> > > > > > > >> > >> > > > > Ron > > >> > >> > > > > > > >> > >> > > > > On Fri, Sep 7, 2018 at 10:50 PM Ron Dagostino < > > >> > rndg...@gmail.com> > > >> > >> > > wrote: > > >> > >> > > > > > > >> > >> > > > > > Hi again, Rajini. It occurs to me that from a > *behavior* > > >> > >> > perspective > > >> > >> > > > > > there are really 3 fundamental differences between the > > >> > low-level > > >> > >> > > > approach > > >> > >> > > > > > you provided and the high-level approach as it > currently > > >> > exists > > >> > >> in > > >> > >> > > the > > >> > >> > > > > PR: > > >> > >> > > > > > > > >> > >> > > > > > 1) *When re-authentication starts*. The low-level > > approach > > >> > >> > initiates > > >> > >> > > > > > re-authentication only if/when the connection is > actually > > >> > used, > > >> > >> so > > >> > >> > it > > >> > >> > > > may > > >> > >> > > > > > start after the existing credential expires; the > current > > PR > > >> > >> > > > > implementation > > >> > >> > > > > > injects re-authentication requests into the existing > > flow, > > >> and > > >> > >> > > > > > re-authentication starts immediately regardless of > > whether > > >> or > > >> > >> not > > >> > >> > the > > >> > >> > > > > > connection is being used for something else. > > >> > >> > > > > > > > >> > >> > > > > > 2) *When requests not related to re-authentication can > > use > > >> the > > >> > >> > > > > > re-authenticating connection*. The low-level approach > > >> > finishes > > >> > >> > > > > > re-authentication completely before allowing anything > > else > > >> to > > >> > >> > travers > > >> > >> > > > the > > >> > >> > > > > > connection; the current PR implementation interleaves > > >> > >> > > re-authentication > > >> > >> > > > > > requests with existing flow requests. > > >> > >> > > > > > > > >> > >> > > > > > 3) *What happens if re-authentication fails*. The > > >> low-level > > >> > >> > approach > > >> > >> > > > > > results in a closed connection and does not support > > >> retries -- > > >> > >> at > > >> > >> > > least > > >> > >> > > > > as > > >> > >> > > > > > currently implemented; the current PR implementation > > >> supports > > >> > >> > > retries. > > >> > >> > > > > > > > >> > >> > > > > > Do you agree that these are the (only?) behavioral > > >> > differences? > > >> > >> > > > > > > > >> > >> > > > > > For these facets of behavior, I wonder what the > > >> requirements > > >> > are > > >> > >> > for > > >> > >> > > > this > > >> > >> > > > > > feature. I think they are as follows: > > >> > >> > > > > > > > >> > >> > > > > > 1) *When re-authentication starts*: I don't think we > > have a > > >> > hard > > >> > >> > > > > > requirement here when considered in isolation -- > whether > > >> > >> > > > > re-authentication > > >> > >> > > > > > starts immediately or is delayed until the connection > is > > >> used > > >> > >> > > probably > > >> > >> > > > > > doesn't matter. > > >> > >> > > > > > > > >> > >> > > > > > 2) *When requests not related to re-authentication can > > use > > >> the > > >> > >> > > > > > re-authenticating connection*: there is a tradeoff here > > >> > between > > >> > >> > > latency > > >> > >> > > > > > and ability to debug re-authentication problems. > > Delaying > > >> use > > >> > >> of > > >> > >> > the > > >> > >> > > > > > connection until re-authentication finishes results in > > >> bigger > > >> > >> > latency > > >> > >> > > > > > spikes but keeps re-authentication requests somewhat > > >> together; > > >> > >> > > > > interleaving > > >> > >> > > > > > minimizes the size of individual latency spikes but > adds > > >> some > > >> > >> > > > separation > > >> > >> > > > > > between the requests. > > >> > >> > > > > > > > >> > >> > > > > > 3) *What happens if re-authentication fails*: I believe > > we > > >> > have > > >> > >> a > > >> > >> > > clear > > >> > >> > > > > > requirement for retry capability given what I have > > written > > >> > >> > > previously. > > >> > >> > > > > Do > > >> > >> > > > > > you agree? Note that while the current low-level > > >> > implementation > > >> > >> > does > > >> > >> > > > not > > >> > >> > > > > > support retry, I have been thinking about how that can > be > > >> > done, > > >> > >> > and I > > >> > >> > > > am > > >> > >> > > > > > pretty sure it can be. We can keep the old > > authenticators > > >> on > > >> > >> the > > >> > >> > > > client > > >> > >> > > > > > and server sides and put them back into place if the > > >> > >> > > re-authentication > > >> > >> > > > > > fails; we would also have to make sure the server side > > >> doesn't > > >> > >> > delay > > >> > >> > > > any > > >> > >> > > > > > failed re-authentication result and also doesn't close > > the > > >> > >> > connection > > >> > >> > > > > upon > > >> > >> > > > > > re-authentication failure. I think I see how to do all > > of > > >> > that. > > >> > >> > > > > > > > >> > >> > > > > > There are some interactions between the above > > requirements. > > >> > If > > >> > >> > > > > > re-authentication can't start immediately and has to > wait > > >> for > > >> > >> the > > >> > >> > > > > > connection to be used then that precludes interleaving > > >> because > > >> > >> we > > >> > >> > > can't > > >> > >> > > > > be > > >> > >> > > > > > sure that the credential will be active by the time it > is > > >> used > > >> > >> -- > > >> > >> > if > > >> > >> > > it > > >> > >> > > > > > isn't active, and we allow interleaving, then requests > > not > > >> > >> related > > >> > >> > to > > >> > >> > > > > > re-authentication will fail if server-side > > >> > >> > > > > > connection-close-due-to-expired-credential > > functionality is > > >> > in > > >> > >> > place. > > >> > >> > > > > Also > > >> > >> > > > > > note that any such server-side connection-close-due-to- > > >> > >> > > > > expired-credential > > >> > >> > > > > > functionality would likely have to avoid closing a > > >> connection > > >> > >> until > > >> > >> > > it > > >> > >> > > > is > > >> > >> > > > > > used for anything other than re-authentication -- it > must > > >> > allow > > >> > >> > > > > > re-authentication requests through when the credential > is > > >> > >> expired. > > >> > >> > > > > > > > >> > >> > > > > > Given all of the above, it feels to me like the > low-level > > >> > >> solution > > >> > >> > is > > >> > >> > > > > > viable only under the following conditions: > > >> > >> > > > > > > > >> > >> > > > > > 1) We must accept a delay in when re-authentication > > occurs > > >> to > > >> > >> when > > >> > >> > > the > > >> > >> > > > > > connection is used > > >> > >> > > > > > 2) We must accept the bigger latency spikes associated > > with > > >> > not > > >> > >> > > > > > interleaving requests > > >> > >> > > > > > > > >> > >> > > > > > Does this sound right to you, and if so, do you find > > these > > >> > >> > conditions > > >> > >> > > > > > acceptable? Or have I missed something and/or made > > >> incorrect > > >> > >> > > > assumptions > > >> > >> > > > > > somewhere? > > >> > >> > > > > > > > >> > >> > > > > > Ron > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > On Fri, Sep 7, 2018 at 5:19 PM Ron Dagostino < > > >> > rndg...@gmail.com > > >> > >> > > > >> > >> > > > wrote: > > >> > >> > > > > > > > >> > >> > > > > >> Hi Rajini. The code really helped me to understand > what > > >> you > > >> > >> were > > >> > >> > > > > >> thinking -- thanks. I'm still digesting, but here are > > >> some > > >> > >> quick > > >> > >> > > > > >> observations: > > >> > >> > > > > >> > > >> > >> > > > > >> Your approach (I'll call it the "low-level" approach, > as > > >> > >> compared > > >> > >> > to > > >> > >> > > > the > > >> > >> > > > > >> existing PR, which works at a higher level of > > >> abstraction) -- > > >> > >> the > > >> > >> > > > > low-level > > >> > >> > > > > >> approach is certainly intriguing. The smaller code > size > > >> is > > >> > >> > welcome, > > >> > >> > > > of > > >> > >> > > > > >> course, as is the fact that re-authentication simply > > works > > >> > for > > >> > >> > > > everyone > > >> > >> > > > > >> regardless of the style of use (async vs. sync I/O). > > >> > >> > > > > >> > > >> > >> > > > > >> I did notice that re-authentication of the connection > > >> starts > > >> > >> only > > >> > >> > > > > if/when > > >> > >> > > > > >> the client uses the connection. For example, if I > run a > > >> > >> console > > >> > >> > > > > producer, > > >> > >> > > > > >> re-authentication does not happen unless I try to > > produce > > >> > >> > something. > > >> > >> > > > On > > >> > >> > > > > >> the one hand this is potentially good -- if the client > > >> isn't > > >> > >> using > > >> > >> > > the > > >> > >> > > > > >> connection then the connection stays "silent" and > could > > be > > >> > >> closed > > >> > >> > on > > >> > >> > > > the > > >> > >> > > > > >> server side if it stays idle long enough -- whereas if > > we > > >> > start > > >> > >> > > > > injecting > > >> > >> > > > > >> re-authentication requests (as is done in the > high-level > > >> > >> approach) > > >> > >> > > > then > > >> > >> > > > > the > > >> > >> > > > > >> connection never goes completely silent and could (?) > > >> > >> potentially > > >> > >> > > > avoid > > >> > >> > > > > >> being closed due to idleness. > > >> > >> > > > > >> > > >> > >> > > > > >> However, if we implement sever-side kill of > connections > > >> using > > >> > >> > > expired > > >> > >> > > > > >> credentials (which we agree is needed), then we might > > end > > >> up > > >> > >> with > > >> > >> > > the > > >> > >> > > > > >> broker killing connections that are sitting idle for > > only > > >> a > > >> > >> short > > >> > >> > > > > period of > > >> > >> > > > > >> time. For example, if we refresh the token on the > > client > > >> > side > > >> > >> and > > >> > >> > > > tell > > >> > >> > > > > the > > >> > >> > > > > >> connection that it is eligible to be re-authenticated, > > >> then > > >> > it > > >> > >> is > > >> > >> > > > > >> conceivable that the connection might be sitting idle > at > > >> that > > >> > >> > point > > >> > >> > > > and > > >> > >> > > > > >> might not be used until after the token it is > currently > > >> using > > >> > >> > > expires. > > >> > >> > > > > The > > >> > >> > > > > >> server might kill the connection, and that would force > > the > > >> > >> client > > >> > >> > to > > >> > >> > > > > >> re-connect with a new connection (requiring TLS > > >> negotiation). > > >> > >> The > > >> > >> > > > > >> probability of this happening increases as the token > > >> lifetime > > >> > >> > > > > decreases, of > > >> > >> > > > > >> course, and it can be offset by decreasing the window > > >> factor > > >> > >> (i.e. > > >> > >> > > > make > > >> > >> > > > > it > > >> > >> > > > > >> eligible for re-authenticate at 50% of the lifetime > > rather > > >> > than > > >> > >> > 80%, > > >> > >> > > > for > > >> > >> > > > > >> example -- it would have to sit idle for longer before > > the > > >> > >> server > > >> > >> > > > tried > > >> > >> > > > > to > > >> > >> > > > > >> kill it). We haven't implemented server-side kill > yet, > > so > > >> > >> maybe > > >> > >> > we > > >> > >> > > > > could > > >> > >> > > > > >> make it intelligent and only kill the connection if it > > is > > >> > used > > >> > >> > (for > > >> > >> > > > > >> anything except re-authentication) after the > expiration > > >> > time... > > >> > >> > > > > >> > > >> > >> > > > > >> I also wonder about the ability to add retry into the > > >> > low-level > > >> > >> > > > > >> approach. Do you think it would be possible? It > > doesn't > > >> > seem > > >> > >> > like > > >> > >> > > it > > >> > >> > > > > to > > >> > >> > > > > >> me -- at least not without some big changes that would > > >> > >> eliminate > > >> > >> > > some > > >> > >> > > > of > > >> > >> > > > > >> the advantage of being able to reuse the existing > > >> > >> authentication > > >> > >> > > code. > > >> > >> > > > > The > > >> > >> > > > > >> reason I ask is because I think retry is necessary. > It > > is > > >> > >> part of > > >> > >> > > how > > >> > >> > > > > >> refreshes work for both GSSAPI and OAUTHBEARER -- they > > >> > refresh > > >> > >> > based > > >> > >> > > > on > > >> > >> > > > > >> some window factor (i.e. 80%) and implement periodic > > retry > > >> > upon > > >> > >> > > > failure > > >> > >> > > > > so > > >> > >> > > > > >> that they can maximize the chances of having a new > > >> credential > > >> > >> > > > available > > >> > >> > > > > for > > >> > >> > > > > >> any new connection attempt. Without refresh we could > > end > > >> up > > >> > in > > >> > >> > the > > >> > >> > > > > >> situation where the connection still has some lifetime > > >> left > > >> > >> (10%, > > >> > >> > > 20%, > > >> > >> > > > > or > > >> > >> > > > > >> whatever) but it tries to re-authenticate and cannot > > >> through > > >> > no > > >> > >> > > fault > > >> > >> > > > of > > >> > >> > > > > >> its own (i.e. token endpoint down, some Kerberos > > failure, > > >> > etc.) > > >> > >> > -=- > > >> > >> > > > the > > >> > >> > > > > >> connection is closed at that point, and it is then > > unable > > >> to > > >> > >> > > reconnect > > >> > >> > > > > >> because of the same temporary problem. We could end > up > > >> with > > >> > an > > >> > >> > > > > especially > > >> > >> > > > > >> ill-timed, temporary outage in some non-Kafka system > > >> (related > > >> > >> to > > >> > >> > > OAuth > > >> > >> > > > > or > > >> > >> > > > > >> Kerberos, or some LDAP directory) causing all clients > to > > >> be > > >> > >> kicked > > >> > >> > > off > > >> > >> > > > > the > > >> > >> > > > > >> cluster. Retry capability seems to me to be the way > to > > >> > >> mitigate > > >> > >> > > this > > >> > >> > > > > risk. > > >> > >> > > > > >> > > >> > >> > > > > >> Anyway, that's it for now. I really like the approach > > you > > >> > >> > outlined > > >> > >> > > -- > > >> > >> > > > > at > > >> > >> > > > > >> least at this point based on my current understanding. > > I > > >> > will > > >> > >> > > > continue > > >> > >> > > > > to > > >> > >> > > > > >> dig in, and I may send more comments/questions. But > for > > >> > now, I > > >> > >> > > think > > >> > >> > > > > the > > >> > >> > > > > >> lack of retry -- and my definitely-could-be-wrong > sense > > >> that > > >> > it > > >> > >> > > can't > > >> > >> > > > > >> easily be added -- is my biggest concern with this > > >> low-level > > >> > >> > > approach. > > >> > >> > > > > >> > > >> > >> > > > > >> Ron > > >> > >> > > > > >> > > >> > >> > > > > >> On Thu, Sep 6, 2018 at 4:57 PM Rajini Sivaram < > > >> > >> > > > rajinisiva...@gmail.com> > > >> > >> > > > > >> wrote: > > >> > >> > > > > >> > > >> > >> > > > > >>> Hi Ron, > > >> > >> > > > > >>> > > >> > >> > > > > >>> The commit > > >> > >> > > > > >>> > > >> > >> > > > > >>> https://github.com/rajinisivaram/kafka/commit/ > > >> > >> > > > > b9d711907ad843c11d17e80d6743bfb1d4e3f3fd > > >> > >> > > > > >>> shows > > >> > >> > > > > >>> the kind of flow I was thinking of. It is just a > > >> prototype > > >> > >> with a > > >> > >> > > > fixed > > >> > >> > > > > >>> re-authentication period to explore the possibility > of > > >> > >> > implementing > > >> > >> > > > > >>> re-authentication similar to authentication. There > will > > >> be > > >> > >> edge > > >> > >> > > cases > > >> > >> > > > > to > > >> > >> > > > > >>> cover and errors to handle, but hopefully the code > > makes > > >> the > > >> > >> > > approach > > >> > >> > > > > >>> clearer than my earlier explanation! > > >> > >> > > > > >>> > > >> > >> > > > > >>> So the differences in the two implementations as you > > have > > >> > >> already > > >> > >> > > > > >>> mentioned > > >> > >> > > > > >>> earlier. > > >> > >> > > > > >>> > > >> > >> > > > > >>> 1. Re-authentication sequences are not interleaved > > >> with > > >> > >> Kafka > > >> > >> > > > > >>> requests. > > >> > >> > > > > >>> As you said, this has a higher impact on latency. > > IMO, > > >> > this > > >> > >> > also > > >> > >> > > > > >>> makes it > > >> > >> > > > > >>> easier to debug, especially with complex protocols > > >> like > > >> > >> > Kerberos > > >> > >> > > > > >>> which are > > >> > >> > > > > >>> notoriously difficult to diagnose. > > >> > >> > > > > >>> 2. Re-authentication failures will not be retried, > > >> they > > >> > >> will > > >> > >> > be > > >> > >> > > > > >>> treated > > >> > >> > > > > >>> as fatal errors similar to authentication > failures. > > >> IMO, > > >> > >> since > > >> > >> > > we > > >> > >> > > > > >>> rely on > > >> > >> > > > > >>> brokers never rejecting valid authentication > > requests > > >> > >> (clients > > >> > >> > > > treat > > >> > >> > > > > >>> authentication failures as fatal errors), it is ok > > to > > >> > fail > > >> > >> on > > >> > >> > > > > >>> re-authentication failure as well. > > >> > >> > > > > >>> 3. Re-authentication is performed on the network > > >> thread > > >> > on > > >> > >> > > brokers > > >> > >> > > > > >>> similar to authentication (in your > implementation, I > > >> > think > > >> > >> it > > >> > >> > > > would > > >> > >> > > > > >>> be on > > >> > >> > > > > >>> the request thread). IMO, it is better do all > > >> > >> authentications > > >> > >> > > > using > > >> > >> > > > > >>> the > > >> > >> > > > > >>> same thread pool. > > >> > >> > > > > >>> 4. Code complexity: I don't think actual code > > >> complexity > > >> > >> would > > >> > >> > > be > > >> > >> > > > > very > > >> > >> > > > > >>> different between the two implementations. But I > > think > > >> > >> there > > >> > >> > is > > >> > >> > > > > value > > >> > >> > > > > >>> in > > >> > >> > > > > >>> keeping re-authentication code within existing > > >> > >> > network/security > > >> > >> > > > > >>> layers. The > > >> > >> > > > > >>> number of classes modified will be smaller and the > > >> number > > >> > >> of > > >> > >> > > > > packages > > >> > >> > > > > >>> modified even smaller. > > >> > >> > > > > >>> > > >> > >> > > > > >>> Anyway, let me know what you think: > > >> > >> > > > > >>> > > >> > >> > > > > >>> - Will this approach work for your scenarios? > > >> > >> > > > > >>> - Do we really need to handle re-authentication > > >> > differently > > >> > >> > from > > >> > >> > > > > >>> authentication? > > >> > >> > > > > >>> > > >> > >> > > > > >>> > > >> > >> > > > > >>> On Thu, Sep 6, 2018 at 1:40 PM, Ron Dagostino < > > >> > >> rndg...@gmail.com > > >> > >> > > > > >> > >> > > > > wrote: > > >> > >> > > > > >>> > > >> > >> > > > > >>> > Yeah, regarding ControllerChannelManager, it is one > > of > > >> the > > >> > >> > > > > synchronous > > >> > >> > > > > >>> I/O > > >> > >> > > > > >>> > use cases (along with 2 others: KafkaProducer, via > > >> Sender; > > >> > >> and > > >> > >> > > > > >>> > ReplicaFetcherBlockingSend, via > ReplicaFetcherThread) > > >> > where > > >> > >> the > > >> > >> > > > > >>> assumption > > >> > >> > > > > >>> > is complete ownership of the connection. The PR's > > >> approach > > >> > >> to > > >> > >> > > > dealing > > >> > >> > > > > >>> with > > >> > >> > > > > >>> > that assumption is to inject the re-authentication > > >> > requests > > >> > >> > into > > >> > >> > > > the > > >> > >> > > > > >>> > owner's existing flow so that they are simply sent > > >> with a > > >> > >> > > callback > > >> > >> > > > > that > > >> > >> > > > > >>> > NetworkClient executes. The alternative approach, > > >> which > > >> > is > > >> > >> > what > > >> > >> > > I > > >> > >> > > > > >>> think > > >> > >> > > > > >>> > you are investigating, is to allow the connection > to > > be > > >> > >> > > temporarily > > >> > >> > > > > >>> taken > > >> > >> > > > > >>> > offline and having any attempt by > > >> ControllerChannelManager > > >> > >> (or > > >> > >> > > > other > > >> > >> > > > > >>> > synchronous use case owner) to use the connection > > >> while it > > >> > >> is > > >> > >> > in > > >> > >> > > > this > > >> > >> > > > > >>> > "offline" state result in some kind of pause until > > the > > >> > >> > connection > > >> > >> > > > > comes > > >> > >> > > > > >>> > back online. One issue with this approach might be > > the > > >> > >> length > > >> > >> > of > > >> > >> > > > > time > > >> > >> > > > > >>> that > > >> > >> > > > > >>> > the connection is unavailable; will it be offline > for > > >> all > > >> > >> > > > > >>> authentication > > >> > >> > > > > >>> > requests and responses > (ApiVersionsRequest/Response, > > >> > >> > > > > >>> > SaslHandshakeRequest/Response, and > > >> > SaslAuthenticateRequest/ > > >> > >> > > > > Response)? > > >> > >> > > > > >>> > Note > > >> > >> > > > > >>> > the last one could actually be invoked multiple > > times, > > >> so > > >> > >> there > > >> > >> > > > could > > >> > >> > > > > >>> be 4 > > >> > >> > > > > >>> > or more round-trips before the authentication > "dance" > > >> is > > >> > >> > > finished. > > >> > >> > > > > >>> Will > > >> > >> > > > > >>> > the connection be "offline" the entire time, or > will > > >> it be > > >> > >> > placed > > >> > >> > > > > back > > >> > >> > > > > >>> > "online" in between each request/response pair to > > allow > > >> > the > > >> > >> > owner > > >> > >> > > > of > > >> > >> > > > > >>> the > > >> > >> > > > > >>> > connection to use it -- in which case the > > >> authentication > > >> > >> > process > > >> > >> > > > > would > > >> > >> > > > > >>> have > > >> > >> > > > > >>> > to wait to get ownership again? The approach I > took > > >> > >> > interleaves > > >> > >> > > > the > > >> > >> > > > > >>> > authentication requests/responses with whatever the > > >> owner > > >> > is > > >> > >> > > doing, > > >> > >> > > > > so > > >> > >> > > > > >>> it > > >> > >> > > > > >>> > is conceivable that use of the connection jumps > back > > >> and > > >> > >> forth > > >> > >> > > > > between > > >> > >> > > > > >>> the > > >> > >> > > > > >>> > two purposes. Such jumping back and forth > minimizes > > >> any > > >> > >> added > > >> > >> > > > > latency > > >> > >> > > > > >>> due > > >> > >> > > > > >>> > to the re-authentication, of course. > > >> > >> > > > > >>> > > > >> > >> > > > > >>> > Anyway, I'll look forward to finding out what you > are > > >> able > > >> > >> to > > >> > >> > > > > conclude. > > >> > >> > > > > >>> > Good luck :-) > > >> > >> > > > > >>> > > > >> > >> > > > > >>> > Ron > > >> > >> > > > > >>> > > > >> > >> > > > > >>> > On Thu, Sep 6, 2018 at 5:17 AM Rajini Sivaram < > > >> > >> > > > > rajinisiva...@gmail.com > > >> > >> > > > > >>> > > > >> > >> > > > > >>> > wrote: > > >> > >> > > > > >>> > > > >> > >> > > > > >>> > > Hi Ron, > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > Disconnections on the broker-side: I think we > > should > > >> do > > >> > >> > > > > >>> disconnections > > >> > >> > > > > >>> > as a > > >> > >> > > > > >>> > > separate KIP and PR as you originally intended. > But > > >> that > > >> > >> one > > >> > >> > > > could > > >> > >> > > > > be > > >> > >> > > > > >>> > done > > >> > >> > > > > >>> > > separately without requiring KIP-368 as a > pre-req. > > >> As a > > >> > >> > simpler > > >> > >> > > > > >>> > > implementation and one that can be used without > > >> KIP-368 > > >> > in > > >> > >> > some > > >> > >> > > > > >>> cases, we > > >> > >> > > > > >>> > > could commit that first since this one may take > > >> longer. > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > I wasn't suggesting that we do move > > >> re-authentication to > > >> > >> > > > > >>> NetworkClient, I > > >> > >> > > > > >>> > > was thinking of the lower layers, handling > > >> > authentication > > >> > >> and > > >> > >> > > > > >>> > > reauthentication at the same layer in a similar > > way. > > >> Let > > >> > >> me > > >> > >> > > look > > >> > >> > > > > >>> into the > > >> > >> > > > > >>> > > code and come up with a more detailed explanation > > to > > >> > avoid > > >> > >> > > > > confusion. > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > I am not too concerned about the imports in > > >> > KafkaChannel. > > >> > >> As > > >> > >> > > you > > >> > >> > > > > >>> said, we > > >> > >> > > > > >>> > > can improve that if we need to. KafkaChannels are > > >> aware > > >> > of > > >> > >> > > > > >>> > > network/authentication states and if that > becomes a > > >> bit > > >> > >> more > > >> > >> > > > > >>> complex, I > > >> > >> > > > > >>> > > don't think it would matter so much. My concern > is > > >> about > > >> > >> > > changes > > >> > >> > > > > like > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > > > >> https://github.com/apache/kafka/pull/5582/files#diff- > > >> > >> > > > > >>> > 987fef43991384a3ebec5fb55e53b577 > > >> > >> > > > > >>> > > in ControllerChannelManager. Those classes > > shouldn't > > >> > have > > >> > >> > deal > > >> > >> > > > with > > >> > >> > > > > >>> SASL > > >> > >> > > > > >>> > or > > >> > >> > > > > >>> > > reauthentication. Anyway, I will get back with > more > > >> > >> detail on > > >> > >> > > > what > > >> > >> > > > > I > > >> > >> > > > > >>> had > > >> > >> > > > > >>> > in > > >> > >> > > > > >>> > > mind since that may not be viable at all. > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > On Thu, Sep 6, 2018 at 1:44 AM, Ron Dagostino < > > >> > >> > > rndg...@gmail.com > > >> > >> > > > > > > >> > >> > > > > >>> wrote: > > >> > >> > > > > >>> > > > > >> > >> > > > > >>> > > > I just thought of another alternative if the > > >> imports > > >> > are > > >> > >> > the > > >> > >> > > > > >>> concern. > > >> > >> > > > > >>> > > > KafkaChannel could expose the fact that it can > > >> create > > >> > an > > >> > >> > > > > additional > > >> > >> > > > > >>> > > > Authenticator instance on the side (what I > > >> referred to > > >> > >> as > > >> > >> > > > > >>> > > > notYetAuthenticatedAuthenticator in the PR) and > > it > > >> > could > > >> > >> > let > > >> > >> > > > > >>> > > > kafka.server.KafkaApis drive the whole thing -- > > >> create > > >> > >> the > > >> > >> > > > > >>> instance on > > >> > >> > > > > >>> > > the > > >> > >> > > > > >>> > > > side, clean it up if it fails, move it into > place > > >> and > > >> > >> close > > >> > >> > > the > > >> > >> > > > > >>> old one > > >> > >> > > > > >>> > > if > > >> > >> > > > > >>> > > > it succeeds, etc. Then Kaf