Hi Rajini.

Agreed, we will bump the SaslHandshakeRequest API number (no wire format
change, of course).
Agreed, responding to re-authentication will always be enabled on the
broker since it is initiated by the client.

Agreed, connection termination by the broker will be opt-in.  I'm thinking
*sasl.server.disconnect.expired.credential.connections.enable*, prefixed
with listener and SASL mechanism name in lower-case so it can be opt-in at
the granularity of a SASL mechanism; for example, "
listener.name.sasl_ssl.oauthbearer.sasl.server.disconnect.expired.credential.connections.enable=true
".  It is long, so if you have a preference for a shorter name let me know
and we'll go with it instead.

Agreed, additional metrics will be helpful.  I'll add them to the KIP.

<<<need to work out exactly how this works with all SASL mechanisms
<<<not sure we have the data required on the broker or a way of
<<< extending the [GSSAPI] mechanism
Not sure I agree here.  The paragraph I added to the KIP describes how I
think this can be done.  Given that description, do you still feel Kerberos
will not be possible?  If Kerberos presents a significant hurdle then I
don't think that should prevent us from doing it for other mechanisms -- I
would rather state that it isn't supported with GSSAPI due to limitations
in Kerberos itself than not have it for OAUTHBEARER, for example.  And
right now I don't think we need more than a high-level description of how
this could be done.  In other words, I think we have this covered, unless
there is a fundamental problem due to Kerberos not making data (i.e. ticket
expiration) available on the server.   I want to submit this KIP for a vote
early next week in the hope of having it accepted by the Monday, 9/24
deadline for the 2.1.0 release, and if that happens I believe I can get a
PR into really good shape soon thereafter (I'm working on it now).

Ron






On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram <rajinisiva...@gmail.com>
wrote:

> Hi Ron,
>
> 1) Yes, we should update the version of `SaslHandshakeRequest`. Clients
> would attempt to re-authenticate only if broker's SaslHandshakeRequest
> version >=2.
> 2) I think we should enable broker's re-authentication and connection
> termination code regardless of client version. Re-authentication could be
> always enabled since it is triggered only by clients and broker could just
> handle it. Connection termination should be opt-in, but should work with
> clients of any version. If turned on and clients are of an older version,
> this would simply force reconnection, which should be ok. Additional
> metrics to monitor expired connections may be useful anyway.
>
> We also need to work out exactly how this works with all SASL mechanisms.
> In particular, we have ensure that we can make it work with Kerberos since
> we use the implementation from the JRE and I am not sure we have the data
> required on the broker or a way of extending the mechanism.
>
> Thoughts?
>
> On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino <rndg...@gmail.com> wrote:
>
> > Hi Rajini.  I'm thinking about how we deal with migration.   For example,
> > let's say we have an existing 2.0.0 cluster using SASL/OAUTHBEARER and we
> > want to use this feature.  The desired end state is to have all clients
> and
> > brokers migrated to a version that supports the feature (call it 2.x)
> with
> > the feature turned on.  We need to document the supported path(s) to this
> > end state.
> >
> > Here are a couple of scenarios with implications:
> >
> > 1) *Migrate client to 2.x and turn the feature on but broker is still at
> > 2.0.*  In this case the client is going to get an error when it sends the
> > SaslHandshakeRequest.  We want to avoid this.  It seems to me the client
> > needs to know if the broker has been upgraded to the necessary version
> > before trying to re-authenticate, which makes me believe we need to bump
> > the API version number in 2.x and the client has to check to see if the
> max
> > version the broker supports meets a minimum standard before trying to
> > re-authenticate.  Do you agree?
> >
> > 2) *Migrate broker to 2.x and turn the feature on but client is still at
> > 2.0*.  In this case the broker is going to end up killing connections
> > periodically.   Again, we want to avoid this.  It's tempting to say
> "don't
> > do this" but I wonder if we can require installations to upgrade all
> > clients before turning the feature on at the brokers.  Certainly we can
> say
> > "don't do this" for inter-broker clients -- in other words, we can say
> that
> > all brokers in a cluster should be upgraded before the feature is turned
> on
> > for any one of them -- but I don't know about our ability to say it for
> > non-broker clients.
> >
> > Now we consider the cases where both the brokers and the clients have
> been
> > upgraded to 2.x.  When and where should the feature be turned on?  The
> > inter-broker case is interesting because I don't think think we can
> require
> > installations to restart every broker with a new config where the feature
> > is turned on at the same time.  Do you agree that we cannot require this
> > and therefore must support installations restarting brokers with a new
> > config at whatever pace they need -- which may be quite slow relative to
> > token lifetimes?  Assuming you do agree, then there is going to be the
> case
> > where some brokers are going to have the feature turned on and some
> clients
> > (definitely inter-broker clients at a minimum) are going to have the
> > feature turned off.
> >
> > The above discussion assumes a single config on the broker side that
> turns
> > on both the inter-broker client re-authentication feature as well as the
> > server-side expired-connection-kill feature, but now I'm thinking we need
> > the ability to turn these features on independently, plus maybe we need a
> > way to monitor the number of "active, expired" connections to the server
> so
> > that operators can be sure that all clients have been upgraded/enabled
> > before turning on the server-side expired-connection-kill feature.
> >
> > So the migration plan would be as follows:
> >
> > 1) Upgrade all brokers to 2.x.
> > 2) After all brokers are upgraded, turn on re-authentication for all
> > brokers at whatever rate is desired -- just eventually, at some point,
> get
> > the client-side feature turned on for all brokers so that inter-broker
> > connections are re-authenticating.
> > 3) In parallel with (1) and (2) above, upgrade clients to 2.x and turn
> > their re-authentication feature on.  Clients will check the API version
> and
> > only re-authenticate to a broker that has also been upgraded (note that
> the
> > ability of a broker to respond to a re-authentication cannot be turned
> off
> > -- it is always on beginning with version 2.x, and it just sits there
> doing
> > nothing if it isn't exercised by an enabled client)
> > 4) After (1), (2), and (3) are complete, check the 2.x broker metrics to
> > confirm that there are no "active, expired" connections.  Once you are
> > satisfied that (1), (2), and (3) are indeed complete you can enable the
> > server-side expired-connection-kill feature on each broker via a restart
> at
> > whatever pace is desired.
> >
> > Comments?
> >
> > Ron
> >
> >
> > On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino <rndg...@gmail.com> wrote:
> >
> > > Ok, I am tempted to just say we go with the low-level approach since it
> > is
> > > the quickest and seems to meet the clear requirements. We can always
> > adjust
> > > later if we get to clarity on other requirements or we decide we need
> to
> > > approach it differently for whatever reason.  But in the meantime,
> before
> > > fully committing to this decision, I would appreciate another
> perspective
> > > if someone has one.
> > >
> > > Ron
> > >
> > > On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram <
> rajinisiva...@gmail.com>
> > > wrote:
> > >
> > >> Hi Ron,
> > >>
> > >> Yes, I would leave out retries from this KIP for now. In the future,
> if
> > >> there is a requirement for supporting retries, we can consider it. I
> > think
> > >> we can support retries with either approach if we needed to, but it
> > would
> > >> be better to do it along with other changes required to support
> > >> authentication servers that are not highly available.
> > >>
> > >> For maintainability, I am biased, so it will be good to get the
> > >> perspective
> > >> of others in the community :-)
> > >>
> > >> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <rndg...@gmail.com>
> > wrote:
> > >>
> > >> > Hi Rajini.  Here is some feedback/some things I thought of.
> > >> >
> > >> > First, I just realized that from a timing perspective that I am not
> > sure
> > >> > retry is going to be necessary.  The background login refresh thread
> > >> > triggers re-authentication when it refreshes the client's
> credential.
> > >> The
> > >> > OAuth infrastructure has to be available in order for this refresh
> to
> > >> > succeed (the background thread repeatedly retries if it can't
> refresh
> > >> the
> > >> > credential, and that retry loop handles any temporary outage).  If
> > >> clients
> > >> > are told to re-authenticate when the credential is refreshed **and
> > they
> > >> > actually re-authenticate at that point** then it is highly unlikely
> > that
> > >> > the OAuth infrastructure would fail within those intervening
> > >> milliseconds.
> > >> > So we don't need re-authentication retry in this KIP as long as
> retry
> > >> > starts immediately.  The low-level prototype as currently coded
> > doesn't
> > >> > actually start re-authentication until the connection is
> subsequently
> > >> used,
> > >> > and that could take a while.  But then again, if the connection
> isn't
> > >> used
> > >> > for some period of time, then the lost value of having to abandon
> the
> > >> > connection is lessened anyway.  Plus, as you pointed out, OAuth
> > >> > infrastructure is assumed to be highly available.
> > >> >
> > >> > Does this makes sense, and would you be willing to say that retry
> > isn't
> > >> a
> > >> > necessary requirement?  I'm tempted but would like to hear your
> > >> perspective
> > >> > first.
> > >> >
> > >> > Interleaving/latency and maintainability (more than lines of code)
> are
> > >> the
> > >> > two remaining areas of comparison.  I did not add the
> maintainability
> > >> issue
> > >> > to the KIP yet, but before I do I thought I would address it here
> > first
> > >> to
> > >> > see if we can come to consensus on it because I'm not sure I see the
> > >> > high-level approach as being hard to maintain (yet -- I could be
> > >> > convinced/convince myself; we'll see).  I do want to make sure we
> are
> > on
> > >> > the same page about what is required to add re-authentication
> support
> > in
> > >> > the high-level case.  Granted, the amount is zero in the low-level
> > case,
> > >> > and it doesn't get any better that that, but the amount in the
> > >> high-level
> > >> > case is very low -- just a few lines of code.  For example:
> > >> >
> > >> > KafkaAdminClient:
> > >> > https://github.com/apache/kafka/pull/5582/commits/4fa70f38b9
> > >> > d33428ff98b64a3a2bd668f5f28c38#diff-6869b8fccf6b098cbcb0676e8ceb26a7
> > >> > It is the same few lines of code for KafkaConsumer, KafkaProducer,
> > >> > WorkerGroupMember, and TransactionMarkerChannelManager
> > >> >
> > >> > The two synchronous I/O use cases are ControllerChannelManager and
> > >> > ReplicaFetcherBlockingSend (via ReplicaFetcherThread), and they
> > require
> > >> a
> > >> > little bit more -- but not much.
> > >> >
> > >> > Thoughts?
> > >> >
> > >> > Ron
> > >> >
> > >> > On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <rndg...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Thanks, Rajini.  Before I digest/respond to that, here's an update
> > >> that I
> > >> > > just completed.
> > >> > >
> > >> > > I added a commit to the PR (
> > >> https://github.com/apache/kafka/pull/5582/)
> > >> > > that implements server-side kill of expired OAUTHBEARER
> connections.
> > >> No
> > >> > > tests yet since we still haven't settled on a final approach
> > >> (low-level
> > >> > vs.
> > >> > > high-level).
> > >> > >
> > >> > > I also updated the KIP to reflect the latest discussion and PR as
> > >> > follows:
> > >> > >
> > >> > >    - Include support for brokers killing connections as part of
> this
> > >> KIP
> > >> > >    (rather than deferring it to a future KIP as was originally
> > >> > mentioned; the
> > >> > >    PR now includes it as mentioned -- it was very easy to add)
> > >> > >    - Added metrics (they will mirror existing ones related to
> > >> > >    authentications; I have not added those to the PR)
> > >> > >    - Updated the implementation description to reflect the current
> > >> state
> > >> > >    of the PR, which is a high-level, one-size-fits-all approach
> (as
> > >> > opposed to
> > >> > >    my initial, even-higher-level approach)
> > >> > >    - Added a "Rejected Alternative" for the first version of the
> PR,
> > >> > >    which injected requests directly into synchronous I/O clients'
> > >> queues
> > >> > >    - Added a "Rejected Alternative" for the low-level approach as
> > >> > >    suggested, but of course we have not formally decided to reject
> > >> this
> > >> > >    approach or adopt the current PR implementation.
> > >> > >
> > >> > > I'll think about where we stand some more before responding again.
> > >> > Thanks
> > >> > > for the above reply.
> > >> > >
> > >> > > Ron
> > >> > >
> > >> > > On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram <
> > >> rajinisiva...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> Hi Ron,
> > >> > >>
> > >> > >> Thank you for summarising, I think it covers the differences
> > between
> > >> the
> > >> > >> two approaches well.
> > >> > >>
> > >> > >> A few minor points to answer the questions in there:
> > >> > >>
> > >> > >> 1) When re-authetication is initiated in the Selector during
> > poll(),
> > >> we
> > >> > >> can
> > >> > >> move an idle channel to re-authentication state. It is similar to
> > >> > >> injecting
> > >> > >> requests, but achieved by changing channel back to authenticating
> > >> state.
> > >> > >>
> > >> > >> 3) To clarify why I think re-authentication should fit in with
> our
> > >> > >> authentication design: My point was not about a specific
> connection
> > >> > being
> > >> > >> usable or not usable. It was about what happens at the client API
> > >> level.
> > >> > >> Our client API (producer/consumer/admin client etc.) currently
> > assume
> > >> > that
> > >> > >> a single broker authentication failure is a fatal error that is
> > never
> > >> > >> retried because we assume that broker only ever fails an
> > >> authentication
> > >> > >> request if credentials are invalid. If we ever decide to support
> > >> cases
> > >> > >> where broker occasionally fails an authentication request due to
> a
> > >> > >> transient failure, we need to do more around how we handle
> > >> > authentication
> > >> > >> failures in clients. We may decide that it is ok to close the
> > >> connection
> > >> > >> for authentication and not for re-authentication as you
> mentioned,
> > >> but
> > >> > we
> > >> > >> need to change the way this disconnection is handled by clients.
> So
> > >> IMO,
> > >> > >> we
> > >> > >> should either add support for transient retriable authentication
> > >> > failures
> > >> > >> properly or not retry for any scenario. Personally, I don't think
> > we
> > >> > would
> > >> > >> want to retry all authentication failures even if it is a
> > >> > >> re-authentication, I think we could (at some point in future),
> > allow
> > >> > >> brokers to return an error code that indicates that it is a
> > transient
> > >> > >> broker-side failure rather than invalid credentials and handle
> the
> > >> error
> > >> > >> differently. I see no reason at that point why we wouldn't handle
> > >> > >> authentication and re-authentication in the same way.
> > >> > >>
> > >> > >> 4) As you said, the high-level approach would be bigger than the
> > >> > low-level
> > >> > >> approach in terms of LOC. But I wouldn't be too concerned about
> > >> lines of
> > >> > >> code. My bigger concern was about modularity. Our security code
> is
> > >> > already
> > >> > >> complex, protocols like Kerberos and SSL that we use from the JRE
> > >> make
> > >> > >> problem diagnosis hard. Async I/O makes the networking code
> > complex.
> > >> You
> > >> > >> need to understand networking layer to work with the security
> > layer,
> > >> but
> > >> > >> the rest of the code base doesn't rely on knowledge of
> > >> network/security
> > >> > >> layers. My main concern about the high-level approach is that it
> > >> spans
> > >> > >> these boundaries, making it harder to maintain in the long run.
> > >> > >>
> > >> > >>
> > >> > >> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski <
> > >> > >> stanis...@confluent.io> wrote:
> > >> > >>
> > >> > >> > Hi Ron, Rajini
> > >> > >> >
> > >> > >> > Thanks for summarizing the discussion so far, Ron!
> > >> > >> >
> > >> > >> > 1) How often do we have such long-lived connection idleness
> (e.g
> > >> 5-10
> > >> > >> > minutes) in practice?
> > >> > >> >
> > >> > >> > 3) I agree that retries for re-authentication are useful.
> > >> > >> >
> > >> > >> > 4) The interleaving of requests sounds like a great feature to
> > >> have,
> > >> > but
> > >> > >> > the tradeoff against code complexity is a tough one. I would
> > >> > personally
> > >> > >> go
> > >> > >> > with the simpler approach since you could always add
> interleaving
> > >> on
> > >> > >> top if
> > >> > >> > the community decides the latency should be better.
> > >> > >> >
> > >> > >> > Best,
> > >> > >> > Stanislav
> > >> > >> >
> > >> > >> > On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <
> rndg...@gmail.com
> > >
> > >> > >> wrote:
> > >> > >> >
> > >> > >> > > Hi everyone.  I've updated the PR to reflect the latest
> > >> conclusions
> > >> > >> from
> > >> > >> > > this ongoing discussion.  The KIP still needs the suggested
> > >> > updates; I
> > >> > >> > will
> > >> > >> > > do that later this week.  II agree with Rajini that some
> > >> additional
> > >> > >> > > feedback from the community at large would be very helpful at
> > >> this
> > >> > >> point
> > >> > >> > in
> > >> > >> > > time.
> > >> > >> > >
> > >> > >> > > Here's where we stand.
> > >> > >> > >
> > >> > >> > > We have 2 separate implementations for re-authenticating
> > existing
> > >> > >> > > connections -- a so-called "low-level" approach and a
> > (relatively
> > >> > >> > speaking)
> > >> > >> > > "high-level" approach -- and we agree on how they should be
> > >> > compared.
> > >> > >> > > Specifically, Rajini has provided a mechanism that works at a
> > >> > >> relatively
> > >> > >> > > low level in the stack by intercepting network sends and
> > queueing
> > >> > >> them up
> > >> > >> > > while re-authentication happens; then the queued sends are
> sent
> > >> > after
> > >> > >> > > re-authentication succeeds, or the connection is closed if
> > >> > >> > > re-authentication fails.  See the prototype commit at
> > >> > >> > >
> > >> > >> > > https://github.com/rajinisivaram/kafka/commit/b9d711907ad843
> > >> > >> > c11d17e80d6743bfb1d4e3f3fd
> > >> > >> > > .
> > >> > >> > >
> > >> > >> > > I have an implementation that works higher up in the stack;
> it
> > >> > injects
> > >> > >> > > authentication requests into the KafkaClient via a new method
> > >> added
> > >> > to
> > >> > >> > that
> > >> > >> > > interface, and the implementation (i.e. NetworkClient) sends
> > them
> > >> > when
> > >> > >> > > poll() is called.  The updated PR is available at
> > >> > >> > > https://github.com/apache/kafka/pull/5582/.
> > >> > >> > >
> > >> > >> > > Here is how we think these two approaches should be compared.
> > >> > >> > >
> > >> > >> > > 1) *When re-authentication starts*.  The low-level approach
> > >> > initiates
> > >> > >> > > re-authentication only if/when the connection is actually
> used,
> > >> so
> > >> > it
> > >> > >> may
> > >> > >> > > start after the existing credential expires; the current PR
> > >> > >> > implementation
> > >> > >> > > injects re-authentication requests into the existing flow,
> and
> > >> > >> > > re-authentication starts immediately regardless of whether or
> > not
> > >> > the
> > >> > >> > > connection is being used for something else.  Rajini believes
> > the
> > >> > >> > low-level
> > >> > >> > > approach can be adjusted to re-authenticate idle connections
> > >> > (Rajini,
> > >> > >> I
> > >> > >> > > don't see how this can be done without injecting requests,
> > which
> > >> is
> > >> > >> what
> > >> > >> > > the high-level connection is doing, but I am probably missing
> > >> > >> something
> > >> > >> > --
> > >> > >> > > no need to go into it unless it is easy to explain.)
> > >> > >> > >
> > >> > >> > > 2) *When requests not related to re-authentication are able
> to
> > >> use
> > >> > the
> > >> > >> > > re-authenticating connection*.  The low-level approach
> finishes
> > >> > >> > > re-authentication completely before allowing anything else to
> > >> > traverse
> > >> > >> > the
> > >> > >> > > connection, and we agree that this is how the low-level
> > >> > implementation
> > >> > >> > must
> > >> > >> > > work without significant work/changes.  The current PR
> > >> > implementation
> > >> > >> > > interleaves re-authentication requests with the existing flow
> > in
> > >> the
> > >> > >> > > asynchronous I/O uses cases (Consumer, Producer, Admin
> Client,
> > >> > etc.),
> > >> > >> and
> > >> > >> > > it allows the two synchronous use cases
> > (ControllerChannelManager
> > >> > and
> > >> > >> > > ReplicaFetcherBlockingSend/ReplicaFetcherThread) to decide
> > which
> > >> > style
> > >> > >> > they
> > >> > >> > > want.  I (somewhat arbitrarily, to prove out the flexibility
> of
> > >> the
> > >> > >> > > high-level approach) coded ControllerChannelManager to do
> > >> complete,
> > >> > >> > > non-interleaved re-authentication and
> > ReplicaFetcherBlockingSend/
> > >> > >> > > ReplicaFetcherThread to interleave requests.  The approach
> has
> > >> > >> impacts on
> > >> > >> > > the maximum size of latency spikes: interleaving requests can
> > >> > decrease
> > >> > >> > the
> > >> > >> > > latency.  The benefit of interleaving is tough to evaluate
> > >> because
> > >> > it
> > >> > >> > isn't
> > >> > >> > > clear what the latency requirement really is.  Note that
> > >> > >> > re-authentication
> > >> > >> > > can entail several network round-trips between the client and
> > the
> > >> > >> broker.
> > >> > >> > > Comments in this area would be especially appreciated.
> > >> > >> > >
> > >> > >> > > 3) *What happens if re-authentication fails (i.e. retry
> > >> > >> capability)*.  I
> > >> > >> > > think this is where we have not yet settled on what the
> > >> requirement
> > >> > is
> > >> > >> > > (even more so than the issue of potential latency mitigation
> > >> > >> > > requirements).  Rajini, you had mentioned that
> > re-authentication
> > >> > >> should
> > >> > >> > > work the same way as authentication, but I see the two
> > >> situations as
> > >> > >> > being
> > >> > >> > > asymmetric.  In the authentication case, if authentication
> > fails,
> > >> > the
> > >> > >> > > connection was never fully established and could not be used,
> > >> and it
> > >> > >> is
> > >> > >> > > closed -- TLS negotiation may have taken place, so that
> effort
> > is
> > >> > >> lost,
> > >> > >> > but
> > >> > >> > > beyond that nothing else is lost.  In the re-authentication
> > case
> > >> the
> > >> > >> > > connection is fully established, it is in use, and in fact it
> > can
> > >> > >> remain
> > >> > >> > in
> > >> > >> > > use for some time going forward as long as the existing
> > >> credential
> > >> > >> > remains
> > >> > >> > > unexpired; abandoning it at this point due to a
> > re-authentication
> > >> > >> failure
> > >> > >> > > (which can occur due to no fault of the client -- i.e. a
> remote
> > >> > >> directory
> > >> > >> > > or OAuth server being temporarily down) abandons a fully
> > >> functioning
> > >> > >> > > connection with remaining lifetime.  Am I thinking about this
> > >> > >> reasonably?
> > >> > >> > > I still tend to believe that retries of failed
> > re-authentications
> > >> > are
> > >> > >> a
> > >> > >> > > requirement.
> > >> > >> > >
> > >> > >> > > 4) *Size*. The low-level approach is smaller in terms of
> code.
> > >> It
> > >> > is
> > >> > >> a
> > >> > >> > > prototype, so it doesn't have everything the high-level
> > approach
> > >> > has,
> > >> > >> > but I
> > >> > >> > > have split the PR for the high-level approach into two
> separate
> > >> > >> commits
> > >> > >> > to
> > >> > >> > > facilitate a more apples-to-apples comparison: the first
> commit
> > >> > >> contains
> > >> > >> > > infrastructure that would have to be added to the low-level
> > >> > prototype
> > >> > >> if
> > >> > >> > we
> > >> > >> > > went with that, and the second commit has everything specific
> > to
> > >> the
> > >> > >> > > high-level approach.  So comparing the low-level prototype
> > commit
> > >> > >> (~350
> > >> > >> > > changes) to the second commit in the high-level PR (~1400
> > >> changes)
> > >> > is
> > >> > >> > > reasonable. The high-level approach at this point is about 4
> > >> times
> > >> > as
> > >> > >> > big,
> > >> > >> > > though it does have lots of error/sanity checking and
> > >> functionality
> > >> > >> (such
> > >> > >> > > as retry) that would to at least some degree have to be added
> > to
> > >> the
> > >> > >> > > low-level implementation.  Still, the high-level approach is
> > (and
> > >> > will
> > >> > >> > > likely remain) somewhat bigger than the low-level approach.
> > >> > >> > >
> > >> > >> > > I hope I presented this in a balanced and accurate way;
> Rajini,
> > >> > please
> > >> > >> > > correct me if I got anything wrong or left out anything
> > >> important.
> > >> > >> > > Apologies in advance if that is the case -- it is unintended.
> > >> > >> > >
> > >> > >> > > Again, comments/discussion from the wider community at this
> > >> juncture
> > >> > >> > would
> > >> > >> > > be especially appreciated.
> > >> > >> > >
> > >> > >> > > Ron
> > >> > >> > >
> > >> > >> > >
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Mon, Sep 10, 2018 at 7:42 AM Rajini Sivaram <
> > >> > >> rajinisiva...@gmail.com>
> > >> > >> > > wrote:
> > >> > >> > >
> > >> > >> > > > Hi Ron,
> > >> > >> > > >
> > >> > >> > > > Thank you for the analysis. Yes, I agree with the
> differences
> > >> you
> > >> > >> have
> > >> > >> > > > pointed out between the two approaches.
> > >> > >> > > >
> > >> > >> > > >    1. Re-authenticating idle connections (or connections
> > >> nearing
> > >> > >> idle
> > >> > >> > > >    timeout): With the lower-level approach, there is
> nothing
> > >> > >> stopping
> > >> > >> > us
> > >> > >> > > > from
> > >> > >> > > >    re-authenticating idle connections. We could either
> > consider
> > >> > idle
> > >> > >> > > >    connections for re-authentication or we could implement
> > >> code in
> > >> > >> > broker
> > >> > >> > > > to
> > >> > >> > > >    kill connections that deals with delays introduced by
> > >> idleness.
> > >> > >> > Since
> > >> > >> > > we
> > >> > >> > > >    expect token lifetimes to be much higher than connection
> > >> expiry
> > >> > >> > time,
> > >> > >> > > >    either would work. And the second approach is probably
> > >> better.
> > >> > >> > > >    2. Interleaving authentication requests and application
> > >> > requests
> > >> > >> to
> > >> > >> > > >    reduce impact on latency: It is not impossible to
> > interleave
> > >> > with
> > >> > >> > the
> > >> > >> > > >    low-level approach, but it adds complexity and reduces
> the
> > >> > value
> > >> > >> of
> > >> > >> > > this
> > >> > >> > > >    approach. So it will be good to understand the
> > requirements
> > >> in
> > >> > >> terms
> > >> > >> > > of
> > >> > >> > > >    latency. In terms of comparing the two approaches, it is
> > >> fair
> > >> > to
> > >> > >> > > compare
> > >> > >> > > >    non-interleaved low-level with the naturally interleaved
> > >> > >> high-level
> > >> > >> > > >    approach.
> > >> > >> > > >    3. Retries: I don't think requirement for retries is
> > >> different
> > >> > >> > between
> > >> > >> > > >    authentication and re-authentication from a client's
> point
> > >> of
> > >> > >> view.
> > >> > >> > We
> > >> > >> > > >    should either support retries for both or not at all. We
> > >> > >> currently
> > >> > >> > > > process
> > >> > >> > > >    authentication failures as fatal errors. We expect that
> if
> > >> you
> > >> > >> are
> > >> > >> > > using
> > >> > >> > > >    external authentication servers, those servers will be
> > >> highly
> > >> > >> > > available.
> > >> > >> > > >    Our built-in authentication mechanisms avoid accessing
> > >> external
> > >> > >> > > servers
> > >> > >> > > > in
> > >> > >> > > >    the authentication path (during authentication,
> > >> verification is
> > >> > >> > > > performed
> > >> > >> > > >    using private/public keys or cache lookup). This is
> > >> necessary
> > >> > in
> > >> > >> our
> > >> > >> > > > design
> > >> > >> > > >    since authentications are performed synchronously on the
> > >> > network
> > >> > >> > > thread.
> > >> > >> > > >
> > >> > >> > > > One more point with the lower-level mechanism: If one day
> we
> > >> > decided
> > >> > >> > that
> > >> > >> > > > we want to extend this for SSL, it would be possible to
> > change
> > >> > just
> > >> > >> the
> > >> > >> > > > transport layer and make it work for SSL. We may never want
> > to
> > >> do
> > >> > >> this,
> > >> > >> > > but
> > >> > >> > > > thought I would point out anyway.
> > >> > >> > > >
> > >> > >> > > > I think the next step would be to add the low-level
> approach
> > to
> > >> > the
> > >> > >> KIP
> > >> > >> > > > under `Rejected Alternatives` with details on the
> > differences.
> > >> And
> > >> > >> more
> > >> > >> > > > detail on requirements covering latency and retries in the
> > >> body of
> > >> > >> the
> > >> > >> > > > KIP.  For "Rejected Alternatives", can we add sub-sections
> (I
> > >> > think
> > >> > >> > there
> > >> > >> > > > are two approaches there now, but you need to read through
> to
> > >> > figure
> > >> > >> > that
> > >> > >> > > > out, so sub-titles would help).
> > >> > >> > > >
> > >> > >> > > > Once the KIP is updated, it will be good to get more
> feedback
> > >> from
> > >> > >> the
> > >> > >> > > > community to decide on the approach to adopt.
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > On Sat, Sep 8, 2018 at 1:27 PM, Ron Dagostino <
> > >> rndg...@gmail.com>
> > >> > >> > wrote:
> > >> > >> > > >
> > >> > >> > > > > Hi yet again, Rajini.  If we decide that interleaving is
> > >> > >> impossible
> > >> > >> > > with
> > >> > >> > > > > the low-level approach but we still want to at least
> > support
> > >> the
> > >> > >> > > > > possibility given that it reduces the size of latency
> > spikes,
> > >> > >> then I
> > >> > >> > do
> > >> > >> > > > > have a suggestion.  I documented in the KIP how I
> rejected
> > a
> > >> > >> single,
> > >> > >> > > > > one-size-fits-all queue approach because we don't know
> when
> > >> > poll()
> > >> > >> > will
> > >> > >> > > > be
> > >> > >> > > > > called in the synchronous I/O use case.  An advantage of
> > >> such an
> > >> > >> > > approach
> > >> > >> > > > > -- if it would have worked, which it didn't -- is that it
> > >> would
> > >> > >> have
> > >> > >> > > been
> > >> > >> > > > > transparent to the owner of the NetworkClient instance
> > (this
> > >> is
> > >> > >> one
> > >> > >> > of
> > >> > >> > > > the
> > >> > >> > > > > big advantages of the low-level approach, after all).
> But
> > we
> > >> > >> could
> > >> > >> > > make
> > >> > >> > > > > the one-size-fits-all queue approach work if we are
> willing
> > >> to
> > >> > >> impose
> > >> > >> > > > some
> > >> > >> > > > > requirements on synchronous I/O users of NetworkClient.
> > >> > >> > Specifically,
> > >> > >> > > we
> > >> > >> > > > > could add a method to the
> > >> org.apache.kafka.clients.KafkaClient
> > >> > >> > > interface
> > >> > >> > > > > (which NetworkClient implements) as follows:
> > >> > >> > > > >
> > >> > >> > > > >     /**
> > >> > >> > > > >
> > >> > >> > > > >      * Return true if any node has re-authentication
> > requests
> > >> > >> waiting
> > >> > >> > > to
> > >> > >> > > > be
> > >> > >> > > > > sent. A
> > >> > >> > > > >
> > >> > >> > > > >      * call to {@link #poll(long, long)} is required to
> > send
> > >> > such
> > >> > >> > > > requests.
> > >> > >> > > > > An owner
> > >> > >> > > > >
> > >> > >> > > > >      * of this instance that does not implement a run
> loop
> > to
> > >> > >> > > repeatedly
> > >> > >> > > > > call
> > >> > >> > > > >
> > >> > >> > > > >      * {@link #poll(long, long)} but instead only sends
> > >> requests
> > >> > >> > > > > synchronously
> > >> > >> > > > >
> > >> > >> > > > >      * on-demand must call this method periodically --
> and
> > >> > invoke
> > >> > >> > > > >
> > >> > >> > > > >      * {@link #poll(long, long)} if the return value is
> > >> {@code
> > >> > >> true}
> > >> > >> > --
> > >> > >> > > > to
> > >> > >> > > > > ensure
> > >> > >> > > > >
> > >> > >> > > > >      * that any re-authentication requests that have ben
> > >> > injected
> > >> > >> are
> > >> > >> > > > sent
> > >> > >> > > > > in a
> > >> > >> > > > >
> > >> > >> > > > >      * timely fashion.
> > >> > >> > > > >
> > >> > >> > > > >      *
> > >> > >> > > > >
> > >> > >> > > > >      * @return true if any node has re-authentication
> > >> requests
> > >> > >> > waiting
> > >> > >> > > to
> > >> > >> > > > > be sent
> > >> > >> > > > >
> > >> > >> > > > >      */
> > >> > >> > > > >
> > >> > >> > > > >     default boolean hasReauthenticationRequests() {
> > >> > >> > > > >
> > >> > >> > > > >         return false;
> > >> > >> > > > >
> > >> > >> > > > >     }
> > >> > >> > > > >
> > >> > >> > > > > If we are willing to add an additional method like this
> and
> > >> make
> > >> > >> the
> > >> > >> > > > minor
> > >> > >> > > > > adjustments to the code associated with the few
> synchronous
> > >> use
> > >> > >> cases
> > >> > >> > > > then
> > >> > >> > > > > I think the high-level approach will work.  We would then
> > >> have
> > >> > the
> > >> > >> > > > > possibility of further parameterizing the various
> > synchronous
> > >> > I/O
> > >> > >> use
> > >> > >> > > > > cases.  For example, there are currently 3:
> > >> > >> > > > >
> > >> > >> > > > > ControllerChannelManager
> > >> > >> > > > > KafkaProducer, via Sender
> > >> > >> > > > > ReplicaFetcherBlockingSend, via ReplicaFetcherThread
> > >> > >> > > > >
> > >> > >> > > > > Is it conceivable that one of these use cases might be
> more
> > >> > >> > > > > latency-sensitive than others and might desire
> > >> interleaving?  A
> > >> > >> > > > high-level
> > >> > >> > > > > approach would give us the option of configuring each use
> > >> case
> > >> > >> > > > accordingly.
> > >> > >> > > > >
> > >> > >> > > > > Ron
> > >> > >> > > > >
> > >> > >> > > > > On Fri, Sep 7, 2018 at 10:50 PM Ron Dagostino <
> > >> > rndg...@gmail.com>
> > >> > >> > > wrote:
> > >> > >> > > > >
> > >> > >> > > > > > Hi again, Rajini.  It occurs to me that from a
> *behavior*
> > >> > >> > perspective
> > >> > >> > > > > > there are really 3 fundamental differences between the
> > >> > low-level
> > >> > >> > > > approach
> > >> > >> > > > > > you provided and the high-level approach as it
> currently
> > >> > exists
> > >> > >> in
> > >> > >> > > the
> > >> > >> > > > > PR:
> > >> > >> > > > > >
> > >> > >> > > > > > 1) *When re-authentication starts*.  The low-level
> > approach
> > >> > >> > initiates
> > >> > >> > > > > > re-authentication only if/when the connection is
> actually
> > >> > used,
> > >> > >> so
> > >> > >> > it
> > >> > >> > > > may
> > >> > >> > > > > > start after the existing credential expires; the
> current
> > PR
> > >> > >> > > > > implementation
> > >> > >> > > > > > injects re-authentication requests into the existing
> > flow,
> > >> and
> > >> > >> > > > > > re-authentication starts immediately regardless of
> > whether
> > >> or
> > >> > >> not
> > >> > >> > the
> > >> > >> > > > > > connection is being used for something else.
> > >> > >> > > > > >
> > >> > >> > > > > > 2) *When requests not related to re-authentication can
> > use
> > >> the
> > >> > >> > > > > > re-authenticating connection*.  The low-level approach
> > >> > finishes
> > >> > >> > > > > > re-authentication completely before allowing anything
> > else
> > >> to
> > >> > >> > travers
> > >> > >> > > > the
> > >> > >> > > > > > connection; the current PR implementation interleaves
> > >> > >> > > re-authentication
> > >> > >> > > > > > requests with existing flow requests.
> > >> > >> > > > > >
> > >> > >> > > > > > 3) *What happens if re-authentication fails*.  The
> > >> low-level
> > >> > >> > approach
> > >> > >> > > > > > results in a closed connection and does not support
> > >> retries --
> > >> > >> at
> > >> > >> > > least
> > >> > >> > > > > as
> > >> > >> > > > > > currently implemented; the current PR implementation
> > >> supports
> > >> > >> > > retries.
> > >> > >> > > > > >
> > >> > >> > > > > > Do you agree that these are the (only?) behavioral
> > >> > differences?
> > >> > >> > > > > >
> > >> > >> > > > > > For these facets of behavior, I wonder what the
> > >> requirements
> > >> > are
> > >> > >> > for
> > >> > >> > > > this
> > >> > >> > > > > > feature.  I think they are as follows:
> > >> > >> > > > > >
> > >> > >> > > > > > 1) *When re-authentication starts*: I don't think we
> > have a
> > >> > hard
> > >> > >> > > > > > requirement here when considered in isolation --
> whether
> > >> > >> > > > > re-authentication
> > >> > >> > > > > > starts immediately or is delayed until the connection
> is
> > >> used
> > >> > >> > > probably
> > >> > >> > > > > > doesn't matter.
> > >> > >> > > > > >
> > >> > >> > > > > > 2) *When requests not related to re-authentication can
> > use
> > >> the
> > >> > >> > > > > > re-authenticating connection*: there is a tradeoff here
> > >> > between
> > >> > >> > > latency
> > >> > >> > > > > > and ability to debug re-authentication problems.
> > Delaying
> > >> use
> > >> > >> of
> > >> > >> > the
> > >> > >> > > > > > connection until re-authentication finishes results in
> > >> bigger
> > >> > >> > latency
> > >> > >> > > > > > spikes but keeps re-authentication requests somewhat
> > >> together;
> > >> > >> > > > > interleaving
> > >> > >> > > > > > minimizes the size of individual latency spikes but
> adds
> > >> some
> > >> > >> > > > separation
> > >> > >> > > > > > between the requests.
> > >> > >> > > > > >
> > >> > >> > > > > > 3) *What happens if re-authentication fails*: I believe
> > we
> > >> > have
> > >> > >> a
> > >> > >> > > clear
> > >> > >> > > > > > requirement for retry capability given what I have
> > written
> > >> > >> > > previously.
> > >> > >> > > > > Do
> > >> > >> > > > > > you agree?  Note that while the current low-level
> > >> > implementation
> > >> > >> > does
> > >> > >> > > > not
> > >> > >> > > > > > support retry, I have been thinking about how that can
> be
> > >> > done,
> > >> > >> > and I
> > >> > >> > > > am
> > >> > >> > > > > > pretty sure it can be.  We can keep the old
> > authenticators
> > >> on
> > >> > >> the
> > >> > >> > > > client
> > >> > >> > > > > > and server sides and put them back into place if the
> > >> > >> > > re-authentication
> > >> > >> > > > > > fails; we would also have to make sure the server side
> > >> doesn't
> > >> > >> > delay
> > >> > >> > > > any
> > >> > >> > > > > > failed re-authentication result and also doesn't close
> > the
> > >> > >> > connection
> > >> > >> > > > > upon
> > >> > >> > > > > > re-authentication failure.  I think I see how to do all
> > of
> > >> > that.
> > >> > >> > > > > >
> > >> > >> > > > > > There are some interactions between the above
> > requirements.
> > >> > If
> > >> > >> > > > > > re-authentication can't start immediately and has to
> wait
> > >> for
> > >> > >> the
> > >> > >> > > > > > connection to be used then that precludes interleaving
> > >> because
> > >> > >> we
> > >> > >> > > can't
> > >> > >> > > > > be
> > >> > >> > > > > > sure that the credential will be active by the time it
> is
> > >> used
> > >> > >> --
> > >> > >> > if
> > >> > >> > > it
> > >> > >> > > > > > isn't active, and we allow interleaving, then requests
> > not
> > >> > >> related
> > >> > >> > to
> > >> > >> > > > > > re-authentication will fail if server-side
> > >> > >> > > > > > connection-close-due-to-expired-credential
> > functionality is
> > >> > in
> > >> > >> > place.
> > >> > >> > > > > Also
> > >> > >> > > > > > note that any such server-side connection-close-due-to-
> > >> > >> > > > > expired-credential
> > >> > >> > > > > > functionality would likely have to avoid closing a
> > >> connection
> > >> > >> until
> > >> > >> > > it
> > >> > >> > > > is
> > >> > >> > > > > > used for anything other than re-authentication -- it
> must
> > >> > allow
> > >> > >> > > > > > re-authentication requests through when the credential
> is
> > >> > >> expired.
> > >> > >> > > > > >
> > >> > >> > > > > > Given all of the above, it feels to me like the
> low-level
> > >> > >> solution
> > >> > >> > is
> > >> > >> > > > > > viable only under the following conditions:
> > >> > >> > > > > >
> > >> > >> > > > > > 1) We must accept a delay in when re-authentication
> > occurs
> > >> to
> > >> > >> when
> > >> > >> > > the
> > >> > >> > > > > > connection is used
> > >> > >> > > > > > 2) We must accept the bigger latency spikes associated
> > with
> > >> > not
> > >> > >> > > > > > interleaving requests
> > >> > >> > > > > >
> > >> > >> > > > > > Does this sound right to you, and if so, do you find
> > these
> > >> > >> > conditions
> > >> > >> > > > > > acceptable?  Or have I missed something and/or made
> > >> incorrect
> > >> > >> > > > assumptions
> > >> > >> > > > > > somewhere?
> > >> > >> > > > > >
> > >> > >> > > > > > Ron
> > >> > >> > > > > >
> > >> > >> > > > > >
> > >> > >> > > > > > On Fri, Sep 7, 2018 at 5:19 PM Ron Dagostino <
> > >> > rndg...@gmail.com
> > >> > >> >
> > >> > >> > > > wrote:
> > >> > >> > > > > >
> > >> > >> > > > > >> Hi Rajini.  The code really helped me to understand
> what
> > >> you
> > >> > >> were
> > >> > >> > > > > >> thinking -- thanks.  I'm still digesting, but here are
> > >> some
> > >> > >> quick
> > >> > >> > > > > >> observations:
> > >> > >> > > > > >>
> > >> > >> > > > > >> Your approach (I'll call it the "low-level" approach,
> as
> > >> > >> compared
> > >> > >> > to
> > >> > >> > > > the
> > >> > >> > > > > >> existing PR, which works at a higher level of
> > >> abstraction) --
> > >> > >> the
> > >> > >> > > > > low-level
> > >> > >> > > > > >> approach is certainly intriguing.  The smaller code
> size
> > >> is
> > >> > >> > welcome,
> > >> > >> > > > of
> > >> > >> > > > > >> course, as is the fact that re-authentication simply
> > works
> > >> > for
> > >> > >> > > > everyone
> > >> > >> > > > > >> regardless of the style of use (async vs. sync I/O).
> > >> > >> > > > > >>
> > >> > >> > > > > >> I did notice that re-authentication of the connection
> > >> starts
> > >> > >> only
> > >> > >> > > > > if/when
> > >> > >> > > > > >> the client uses the connection.  For example, if I
> run a
> > >> > >> console
> > >> > >> > > > > producer,
> > >> > >> > > > > >> re-authentication does not happen unless I try to
> > produce
> > >> > >> > something.
> > >> > >> > > > On
> > >> > >> > > > > >> the one hand this is potentially good -- if the client
> > >> isn't
> > >> > >> using
> > >> > >> > > the
> > >> > >> > > > > >> connection then the connection stays "silent" and
> could
> > be
> > >> > >> closed
> > >> > >> > on
> > >> > >> > > > the
> > >> > >> > > > > >> server side if it stays idle long enough -- whereas if
> > we
> > >> > start
> > >> > >> > > > > injecting
> > >> > >> > > > > >> re-authentication requests (as is done in the
> high-level
> > >> > >> approach)
> > >> > >> > > > then
> > >> > >> > > > > the
> > >> > >> > > > > >> connection never goes completely silent and could (?)
> > >> > >> potentially
> > >> > >> > > > avoid
> > >> > >> > > > > >> being closed due to idleness.
> > >> > >> > > > > >>
> > >> > >> > > > > >> However, if we implement sever-side kill of
> connections
> > >> using
> > >> > >> > > expired
> > >> > >> > > > > >> credentials (which we agree is needed), then we might
> > end
> > >> up
> > >> > >> with
> > >> > >> > > the
> > >> > >> > > > > >> broker killing connections that are sitting idle for
> > only
> > >> a
> > >> > >> short
> > >> > >> > > > > period of
> > >> > >> > > > > >> time.  For example, if we refresh the token on the
> > client
> > >> > side
> > >> > >> and
> > >> > >> > > > tell
> > >> > >> > > > > the
> > >> > >> > > > > >> connection that it is eligible to be re-authenticated,
> > >> then
> > >> > it
> > >> > >> is
> > >> > >> > > > > >> conceivable that the connection might be sitting idle
> at
> > >> that
> > >> > >> > point
> > >> > >> > > > and
> > >> > >> > > > > >> might not be used until after the token it is
> currently
> > >> using
> > >> > >> > > expires.
> > >> > >> > > > > The
> > >> > >> > > > > >> server might kill the connection, and that would force
> > the
> > >> > >> client
> > >> > >> > to
> > >> > >> > > > > >> re-connect with a new connection (requiring TLS
> > >> negotiation).
> > >> > >> The
> > >> > >> > > > > >> probability of this happening increases as the token
> > >> lifetime
> > >> > >> > > > > decreases, of
> > >> > >> > > > > >> course, and it can be offset by decreasing the window
> > >> factor
> > >> > >> (i.e.
> > >> > >> > > > make
> > >> > >> > > > > it
> > >> > >> > > > > >> eligible for re-authenticate at 50% of the lifetime
> > rather
> > >> > than
> > >> > >> > 80%,
> > >> > >> > > > for
> > >> > >> > > > > >> example -- it would have to sit idle for longer before
> > the
> > >> > >> server
> > >> > >> > > > tried
> > >> > >> > > > > to
> > >> > >> > > > > >> kill it).  We haven't implemented server-side kill
> yet,
> > so
> > >> > >> maybe
> > >> > >> > we
> > >> > >> > > > > could
> > >> > >> > > > > >> make it intelligent and only kill the connection if it
> > is
> > >> > used
> > >> > >> > (for
> > >> > >> > > > > >> anything except re-authentication) after the
> expiration
> > >> > time...
> > >> > >> > > > > >>
> > >> > >> > > > > >> I also wonder about the ability to add retry into the
> > >> > low-level
> > >> > >> > > > > >> approach.  Do you think it would be possible?  It
> > doesn't
> > >> > seem
> > >> > >> > like
> > >> > >> > > it
> > >> > >> > > > > to
> > >> > >> > > > > >> me -- at least not without some big changes that would
> > >> > >> eliminate
> > >> > >> > > some
> > >> > >> > > > of
> > >> > >> > > > > >> the advantage of being able to reuse the existing
> > >> > >> authentication
> > >> > >> > > code.
> > >> > >> > > > > The
> > >> > >> > > > > >> reason I ask is because I think retry is necessary.
> It
> > is
> > >> > >> part of
> > >> > >> > > how
> > >> > >> > > > > >> refreshes work for both GSSAPI and OAUTHBEARER -- they
> > >> > refresh
> > >> > >> > based
> > >> > >> > > > on
> > >> > >> > > > > >> some window factor (i.e. 80%) and implement periodic
> > retry
> > >> > upon
> > >> > >> > > > failure
> > >> > >> > > > > so
> > >> > >> > > > > >> that they can maximize the chances of having a new
> > >> credential
> > >> > >> > > > available
> > >> > >> > > > > for
> > >> > >> > > > > >> any new connection attempt.  Without refresh we could
> > end
> > >> up
> > >> > in
> > >> > >> > the
> > >> > >> > > > > >> situation where the connection still has some lifetime
> > >> left
> > >> > >> (10%,
> > >> > >> > > 20%,
> > >> > >> > > > > or
> > >> > >> > > > > >> whatever) but it tries to re-authenticate and cannot
> > >> through
> > >> > no
> > >> > >> > > fault
> > >> > >> > > > of
> > >> > >> > > > > >> its own (i.e. token endpoint down, some Kerberos
> > failure,
> > >> > etc.)
> > >> > >> > -=-
> > >> > >> > > > the
> > >> > >> > > > > >> connection is closed at that point, and it is then
> > unable
> > >> to
> > >> > >> > > reconnect
> > >> > >> > > > > >> because of the same temporary problem.  We could end
> up
> > >> with
> > >> > an
> > >> > >> > > > > especially
> > >> > >> > > > > >> ill-timed, temporary outage in some non-Kafka system
> > >> (related
> > >> > >> to
> > >> > >> > > OAuth
> > >> > >> > > > > or
> > >> > >> > > > > >> Kerberos, or some LDAP directory) causing all clients
> to
> > >> be
> > >> > >> kicked
> > >> > >> > > off
> > >> > >> > > > > the
> > >> > >> > > > > >> cluster.  Retry capability seems to me to be the way
> to
> > >> > >> mitigate
> > >> > >> > > this
> > >> > >> > > > > risk.
> > >> > >> > > > > >>
> > >> > >> > > > > >> Anyway, that's it for now.  I really like the approach
> > you
> > >> > >> > outlined
> > >> > >> > > --
> > >> > >> > > > > at
> > >> > >> > > > > >> least at this point based on my current understanding.
> > I
> > >> > will
> > >> > >> > > > continue
> > >> > >> > > > > to
> > >> > >> > > > > >> dig in, and I may send more comments/questions.  But
> for
> > >> > now, I
> > >> > >> > > think
> > >> > >> > > > > the
> > >> > >> > > > > >> lack of retry -- and my definitely-could-be-wrong
> sense
> > >> that
> > >> > it
> > >> > >> > > can't
> > >> > >> > > > > >> easily be added -- is my biggest concern with this
> > >> low-level
> > >> > >> > > approach.
> > >> > >> > > > > >>
> > >> > >> > > > > >> Ron
> > >> > >> > > > > >>
> > >> > >> > > > > >> On Thu, Sep 6, 2018 at 4:57 PM Rajini Sivaram <
> > >> > >> > > > rajinisiva...@gmail.com>
> > >> > >> > > > > >> wrote:
> > >> > >> > > > > >>
> > >> > >> > > > > >>> Hi Ron,
> > >> > >> > > > > >>>
> > >> > >> > > > > >>> The commit
> > >> > >> > > > > >>>
> > >> > >> > > > > >>> https://github.com/rajinisivaram/kafka/commit/
> > >> > >> > > > > b9d711907ad843c11d17e80d6743bfb1d4e3f3fd
> > >> > >> > > > > >>> shows
> > >> > >> > > > > >>> the kind of flow I was thinking of. It is just a
> > >> prototype
> > >> > >> with a
> > >> > >> > > > fixed
> > >> > >> > > > > >>> re-authentication period to explore the possibility
> of
> > >> > >> > implementing
> > >> > >> > > > > >>> re-authentication similar to authentication. There
> will
> > >> be
> > >> > >> edge
> > >> > >> > > cases
> > >> > >> > > > > to
> > >> > >> > > > > >>> cover and errors to handle, but hopefully the code
> > makes
> > >> the
> > >> > >> > > approach
> > >> > >> > > > > >>> clearer than my earlier explanation!
> > >> > >> > > > > >>>
> > >> > >> > > > > >>> So the differences in the two implementations as you
> > have
> > >> > >> already
> > >> > >> > > > > >>> mentioned
> > >> > >> > > > > >>> earlier.
> > >> > >> > > > > >>>
> > >> > >> > > > > >>>    1. Re-authentication sequences are not interleaved
> > >> with
> > >> > >> Kafka
> > >> > >> > > > > >>> requests.
> > >> > >> > > > > >>>    As you said, this has a higher impact on latency.
> > IMO,
> > >> > this
> > >> > >> > also
> > >> > >> > > > > >>> makes it
> > >> > >> > > > > >>>    easier to debug, especially with complex protocols
> > >> like
> > >> > >> > Kerberos
> > >> > >> > > > > >>> which are
> > >> > >> > > > > >>>    notoriously difficult to diagnose.
> > >> > >> > > > > >>>    2. Re-authentication failures will not be retried,
> > >> they
> > >> > >> will
> > >> > >> > be
> > >> > >> > > > > >>> treated
> > >> > >> > > > > >>>    as fatal errors similar to authentication
> failures.
> > >> IMO,
> > >> > >> since
> > >> > >> > > we
> > >> > >> > > > > >>> rely on
> > >> > >> > > > > >>>    brokers never rejecting valid authentication
> > requests
> > >> > >> (clients
> > >> > >> > > > treat
> > >> > >> > > > > >>>    authentication failures as fatal errors), it is ok
> > to
> > >> > fail
> > >> > >> on
> > >> > >> > > > > >>>    re-authentication failure as well.
> > >> > >> > > > > >>>    3. Re-authentication is performed on the network
> > >> thread
> > >> > on
> > >> > >> > > brokers
> > >> > >> > > > > >>>    similar to authentication (in your
> implementation, I
> > >> > think
> > >> > >> it
> > >> > >> > > > would
> > >> > >> > > > > >>> be on
> > >> > >> > > > > >>>    the request thread). IMO, it is better do all
> > >> > >> authentications
> > >> > >> > > > using
> > >> > >> > > > > >>> the
> > >> > >> > > > > >>>    same thread pool.
> > >> > >> > > > > >>>    4. Code complexity: I don't think actual code
> > >> complexity
> > >> > >> would
> > >> > >> > > be
> > >> > >> > > > > very
> > >> > >> > > > > >>>    different between the two implementations. But I
> > think
> > >> > >> there
> > >> > >> > is
> > >> > >> > > > > value
> > >> > >> > > > > >>> in
> > >> > >> > > > > >>>    keeping re-authentication code within existing
> > >> > >> > network/security
> > >> > >> > > > > >>> layers. The
> > >> > >> > > > > >>>    number of classes modified will be smaller and the
> > >> number
> > >> > >> of
> > >> > >> > > > > packages
> > >> > >> > > > > >>>    modified even smaller.
> > >> > >> > > > > >>>
> > >> > >> > > > > >>> Anyway, let me know what you think:
> > >> > >> > > > > >>>
> > >> > >> > > > > >>>    - Will this approach work for your scenarios?
> > >> > >> > > > > >>>    - Do we really need to handle re-authentication
> > >> > differently
> > >> > >> > from
> > >> > >> > > > > >>>    authentication?
> > >> > >> > > > > >>>
> > >> > >> > > > > >>>
> > >> > >> > > > > >>> On Thu, Sep 6, 2018 at 1:40 PM, Ron Dagostino <
> > >> > >> rndg...@gmail.com
> > >> > >> > >
> > >> > >> > > > > wrote:
> > >> > >> > > > > >>>
> > >> > >> > > > > >>> > Yeah, regarding ControllerChannelManager, it is one
> > of
> > >> the
> > >> > >> > > > > synchronous
> > >> > >> > > > > >>> I/O
> > >> > >> > > > > >>> > use cases (along with 2 others: KafkaProducer, via
> > >> Sender;
> > >> > >> and
> > >> > >> > > > > >>> > ReplicaFetcherBlockingSend, via
> ReplicaFetcherThread)
> > >> > where
> > >> > >> the
> > >> > >> > > > > >>> assumption
> > >> > >> > > > > >>> > is complete ownership of the connection. The PR's
> > >> approach
> > >> > >> to
> > >> > >> > > > dealing
> > >> > >> > > > > >>> with
> > >> > >> > > > > >>> > that assumption is to inject the re-authentication
> > >> > requests
> > >> > >> > into
> > >> > >> > > > the
> > >> > >> > > > > >>> > owner's existing flow so that they are simply sent
> > >> with a
> > >> > >> > > callback
> > >> > >> > > > > that
> > >> > >> > > > > >>> > NetworkClient executes.  The alternative approach,
> > >> which
> > >> > is
> > >> > >> > what
> > >> > >> > > I
> > >> > >> > > > > >>> think
> > >> > >> > > > > >>> > you are investigating, is to allow the connection
> to
> > be
> > >> > >> > > temporarily
> > >> > >> > > > > >>> taken
> > >> > >> > > > > >>> > offline and having any attempt by
> > >> ControllerChannelManager
> > >> > >> (or
> > >> > >> > > > other
> > >> > >> > > > > >>> > synchronous use case owner) to use the connection
> > >> while it
> > >> > >> is
> > >> > >> > in
> > >> > >> > > > this
> > >> > >> > > > > >>> > "offline" state result in some kind of pause until
> > the
> > >> > >> > connection
> > >> > >> > > > > comes
> > >> > >> > > > > >>> > back online.  One issue with this approach might be
> > the
> > >> > >> length
> > >> > >> > of
> > >> > >> > > > > time
> > >> > >> > > > > >>> that
> > >> > >> > > > > >>> > the connection is unavailable; will it be offline
> for
> > >> all
> > >> > >> > > > > >>> authentication
> > >> > >> > > > > >>> > requests and responses
> (ApiVersionsRequest/Response,
> > >> > >> > > > > >>> > SaslHandshakeRequest/Response, and
> > >> > SaslAuthenticateRequest/
> > >> > >> > > > > Response)?
> > >> > >> > > > > >>> > Note
> > >> > >> > > > > >>> > the last one could actually be invoked multiple
> > times,
> > >> so
> > >> > >> there
> > >> > >> > > > could
> > >> > >> > > > > >>> be 4
> > >> > >> > > > > >>> > or more round-trips before the authentication
> "dance"
> > >> is
> > >> > >> > > finished.
> > >> > >> > > > > >>> Will
> > >> > >> > > > > >>> > the connection be "offline" the entire time, or
> will
> > >> it be
> > >> > >> > placed
> > >> > >> > > > > back
> > >> > >> > > > > >>> > "online" in between each request/response pair to
> > allow
> > >> > the
> > >> > >> > owner
> > >> > >> > > > of
> > >> > >> > > > > >>> the
> > >> > >> > > > > >>> > connection to use it -- in which case the
> > >> authentication
> > >> > >> > process
> > >> > >> > > > > would
> > >> > >> > > > > >>> have
> > >> > >> > > > > >>> > to wait to get ownership again?  The approach I
> took
> > >> > >> > interleaves
> > >> > >> > > > the
> > >> > >> > > > > >>> > authentication requests/responses with whatever the
> > >> owner
> > >> > is
> > >> > >> > > doing,
> > >> > >> > > > > so
> > >> > >> > > > > >>> it
> > >> > >> > > > > >>> > is conceivable that use of the connection jumps
> back
> > >> and
> > >> > >> forth
> > >> > >> > > > > between
> > >> > >> > > > > >>> the
> > >> > >> > > > > >>> > two purposes.  Such jumping back and forth
> minimizes
> > >> any
> > >> > >> added
> > >> > >> > > > > latency
> > >> > >> > > > > >>> due
> > >> > >> > > > > >>> > to the re-authentication, of course.
> > >> > >> > > > > >>> >
> > >> > >> > > > > >>> > Anyway, I'll look forward to finding out what you
> are
> > >> able
> > >> > >> to
> > >> > >> > > > > conclude.
> > >> > >> > > > > >>> > Good luck :-)
> > >> > >> > > > > >>> >
> > >> > >> > > > > >>> > Ron
> > >> > >> > > > > >>> >
> > >> > >> > > > > >>> > On Thu, Sep 6, 2018 at 5:17 AM Rajini Sivaram <
> > >> > >> > > > > rajinisiva...@gmail.com
> > >> > >> > > > > >>> >
> > >> > >> > > > > >>> > wrote:
> > >> > >> > > > > >>> >
> > >> > >> > > > > >>> > > Hi Ron,
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > > Disconnections on the broker-side: I think we
> > should
> > >> do
> > >> > >> > > > > >>> disconnections
> > >> > >> > > > > >>> > as a
> > >> > >> > > > > >>> > > separate KIP and PR as you originally intended.
> But
> > >> that
> > >> > >> one
> > >> > >> > > > could
> > >> > >> > > > > be
> > >> > >> > > > > >>> > done
> > >> > >> > > > > >>> > > separately without requiring KIP-368 as a
> pre-req.
> > >> As a
> > >> > >> > simpler
> > >> > >> > > > > >>> > > implementation and one that can be used without
> > >> KIP-368
> > >> > in
> > >> > >> > some
> > >> > >> > > > > >>> cases, we
> > >> > >> > > > > >>> > > could commit that first since this one may take
> > >> longer.
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > > I wasn't suggesting that we do move
> > >> re-authentication to
> > >> > >> > > > > >>> NetworkClient, I
> > >> > >> > > > > >>> > > was thinking of the lower layers, handling
> > >> > authentication
> > >> > >> and
> > >> > >> > > > > >>> > > reauthentication at the same layer in a similar
> > way.
> > >> Let
> > >> > >> me
> > >> > >> > > look
> > >> > >> > > > > >>> into the
> > >> > >> > > > > >>> > > code and come up with a more detailed explanation
> > to
> > >> > avoid
> > >> > >> > > > > confusion.
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > > I am not too concerned about the imports in
> > >> > KafkaChannel.
> > >> > >> As
> > >> > >> > > you
> > >> > >> > > > > >>> said, we
> > >> > >> > > > > >>> > > can improve that if we need to. KafkaChannels are
> > >> aware
> > >> > of
> > >> > >> > > > > >>> > > network/authentication states and if that
> becomes a
> > >> bit
> > >> > >> more
> > >> > >> > > > > >>> complex, I
> > >> > >> > > > > >>> > > don't think it would matter so much. My concern
> is
> > >> about
> > >> > >> > > changes
> > >> > >> > > > > like
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > >
> > >> https://github.com/apache/kafka/pull/5582/files#diff-
> > >> > >> > > > > >>> > 987fef43991384a3ebec5fb55e53b577
> > >> > >> > > > > >>> > > in ControllerChannelManager. Those classes
> > shouldn't
> > >> > have
> > >> > >> > deal
> > >> > >> > > > with
> > >> > >> > > > > >>> SASL
> > >> > >> > > > > >>> > or
> > >> > >> > > > > >>> > > reauthentication. Anyway, I will get back with
> more
> > >> > >> detail on
> > >> > >> > > > what
> > >> > >> > > > > I
> > >> > >> > > > > >>> had
> > >> > >> > > > > >>> > in
> > >> > >> > > > > >>> > > mind since that may not be viable at all.
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > > On Thu, Sep 6, 2018 at 1:44 AM, Ron Dagostino <
> > >> > >> > > rndg...@gmail.com
> > >> > >> > > > >
> > >> > >> > > > > >>> wrote:
> > >> > >> > > > > >>> > >
> > >> > >> > > > > >>> > > > I just thought of another alternative if the
> > >> imports
> > >> > are
> > >> > >> > the
> > >> > >> > > > > >>> concern.
> > >> > >> > > > > >>> > > > KafkaChannel could expose the fact that it can
> > >> create
> > >> > an
> > >> > >> > > > > additional
> > >> > >> > > > > >>> > > > Authenticator instance on the side (what I
> > >> referred to
> > >> > >> as
> > >> > >> > > > > >>> > > > notYetAuthenticatedAuthenticator in the PR) and
> > it
> > >> > could
> > >> > >> > let
> > >> > >> > > > > >>> > > > kafka.server.KafkaApis drive the whole thing --
> > >> create
> > >> > >> the
> > >> > >> > > > > >>> instance on
> > >> > >> > > > > >>> > > the
> > >> > >> > > > > >>> > > > side, clean it up if it fails, move it into
> place
> > >> and
> > >> > >> close
> > >> > >> > > the
> > >> > >> > > > > >>> old one
> > >> > >> > > > > >>> > > if
> > >> > >> > > > > >>> > > > it succeeds, etc.  Then Kaf

Reply via email to