Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Ron Dagostino Fri, 14 Sep 2018 18:00:32 -0700

Minor correction: I'm proposing "connections.max.reauth.ms" as the
broker-side configuration property, not "
connections.max.expired.credentials.ms".


Ron

On Fri, Sep 14, 2018 at 8:40 PM Ron Dagostino <rndg...@gmail.com> wrote:

> Hi Rajini.  I'm going to choose *connections.max.expired.credentials.ms
> <http://connections.max.expired.credentials.ms>* as the option name
> because it is consistent with the comment I made in the section about how
> to get the client and server to agree on credential lifetime:
>
> "We could add a new API call so that clients could ask servers for the
> lifetime they use, or we could extend the SaslHandshakeRequest/Response API
> call to include that information in the server's response – the client
> would then adopt that value"
>
>
> We set the config option value on the broker (with the "
> listener.name.mechanism." prefix), and we will return the configured
> value in the SaslHandshakeResponse (requiring a wire format change in
> addition to a version bump).  The value (referred to as "X" below) can be
> negative, zero, or positive and is to be interpreted as follows:
>
> X = 0: Fully disable.  The server will respond to re-authentications, but
> it won't kill any connections due to expiration, and it won't track either
> of the 2 metrics mentioned below.
>
> Now, a couple of definitions for when X != 0:
>
> 1) We define the *maximum allowed expiration time* to be the time
> determined by the point when (re-)authentication occurs plus |X|
> milliseconds.
> 2) We define the *requested expiration time* to be the maximum allowed
> expiration time except for the case of an OAuth bearer token, in which case
> it is the point at which the token expires.  (Kerberos presumably also
> specifies a ticket lifetime, but frankly I am not in a position to do any
> Kerberos-related coding in time for a 2.1.0 release and would prefer to
> ignore this piece of information for this KIP -- would it be acceptable to
> have someone else add it later?).
>
> Based on these definitions, we define the behavior as follows:
>
> *X > 0: Fully enable*
> A) The server will reject any authentication/re-authentication attempt
> when the requested expiration time is after the maximum allowed expiration
> time (which could only happen with an OAuth bearer token, assuming we skip
> dealing with Kerberos for now).
> B) The server will kill connections that are used beyond the requested
> expiration time.
> C) A broker metric will be maintained that documents the number
> connections killed by the broker.  This metric will be non-zero if a client
> is connecting to the broker with re-authentication either unavailable (i.e.
> an older client) or disabled.
>
> *X < 0: Partially enable*
> A) Same as above: the server will reject any
> authentication/re-authentication attempt when the requested expiration time
> is after the maximum allowed expiration time (which could only happen with
> an OAuth bearer token, assuming we skip dealing with Kerberos for now).
> B) The server will **not** kill connections that are used beyond the
> requested expiration time.
> C) A broker metric will be maintained that documents the number of API
> requests unrelated to re-authentication that are made over a connection
> beyond the requested expiration time.  This metric will be helpful for
> operators to ensure that all clients are properly upgraded and
> re-authenticating before fully enabling the server-side
> expired-connection-kill functionality (i.e. by changing the value from a
> negative number to a positive number): this metric will be zero across all
> brokers when it is safe to fully enable the server-side feature.
>
> Thoughts?
>
> Ron
>
> On Fri, Sep 14, 2018 at 11:59 AM Rajini Sivaram <rajinisiva...@gmail.com>
> wrote:
>
>> It will be good to shorten the config name. We have a config named `
>> connections.max.idle.ms`. We could add something like `
>> connections.max.expired.credentials.ms`. As you said, it would be
>> prefixed
>> with listener and SASL mechanism name. We should be able to support
>> connection termination in future even with SSL, so perhaps we don't want
>> the `sasl.server.` prefix (we just need to validate whether the config is
>> supported for the protocol). You can choose whether a boolean flag or a
>> time interval makes more sense.
>>
>> For the client-side, the KIP explains how to make it work for other
>> mechanisms and we can leave it that. Not convinced about server-side
>> though. It seems to me that we probably would require API changes to make
>> it work. Basically the proposed approach works only for OAUTHBEARER. Since
>> we have to bump up SaslHandshakeRequest version in this KIP, it will be
>> good to work out if we need to change the request or flow to make this
>> work
>> for other mechanisms. I haven't figured out exactly what is needed, but
>> will think about it and get back next week. In the meantime, you can get
>> the KIP up-to-date with the config, migration path etc. and get it ready
>> to
>> start vote next week.
>>
>>
>>
>>
>>
>> On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino <rndg...@gmail.com> wrote:
>>
>> > Hi Rajini.
>> >
>> > Agreed, we will bump the SaslHandshakeRequest API number (no wire format
>> > change, of course).
>> >
>> > Agreed, responding to re-authentication will always be enabled on the
>> > broker since it is initiated by the client.
>> >
>> > Agreed, connection termination by the broker will be opt-in.  I'm
>> thinking
>> > *sasl.server.disconnect.expired.credential.connections.enable*, prefixed
>> > with listener and SASL mechanism name in lower-case so it can be opt-in
>> at
>> > the granularity of a SASL mechanism; for example, "
>> > listener.name.sasl_ssl.oauthbearer.sasl.server.
>> > disconnect.expired.credential.connections.enable=true
>> > ".  It is long, so if you have a preference for a shorter name let me
>> know
>> > and we'll go with it instead.
>> >
>> > Agreed, additional metrics will be helpful.  I'll add them to the KIP.
>> >
>> > <<<need to work out exactly how this works with all SASL mechanisms
>> > <<<not sure we have the data required on the broker or a way of
>> > <<< extending the [GSSAPI] mechanism
>> > Not sure I agree here.  The paragraph I added to the KIP describes how I
>> > think this can be done.  Given that description, do you still feel
>> Kerberos
>> > will not be possible?  If Kerberos presents a significant hurdle then I
>> > don't think that should prevent us from doing it for other mechanisms
>> -- I
>> > would rather state that it isn't supported with GSSAPI due to
>> limitations
>> > in Kerberos itself than not have it for OAUTHBEARER, for example.  And
>> > right now I don't think we need more than a high-level description of
>> how
>> > this could be done.  In other words, I think we have this covered,
>> unless
>> > there is a fundamental problem due to Kerberos not making data (i.e.
>> ticket
>> > expiration) available on the server.   I want to submit this KIP for a
>> vote
>> > early next week in the hope of having it accepted by the Monday, 9/24
>> > deadline for the 2.1.0 release, and if that happens I believe I can get
>> a
>> > PR into really good shape soon thereafter (I'm working on it now).
>> >
>> > Ron
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram <rajinisiva...@gmail.com
>> >
>> > wrote:
>> >
>> > > Hi Ron,
>> > >
>> > > 1) Yes, we should update the version of `SaslHandshakeRequest`.
>> Clients
>> > > would attempt to re-authenticate only if broker's SaslHandshakeRequest
>> > > version >=2.
>> > > 2) I think we should enable broker's re-authentication and connection
>> > > termination code regardless of client version. Re-authentication
>> could be
>> > > always enabled since it is triggered only by clients and broker could
>> > just
>> > > handle it. Connection termination should be opt-in, but should work
>> with
>> > > clients of any version. If turned on and clients are of an older
>> version,
>> > > this would simply force reconnection, which should be ok. Additional
>> > > metrics to monitor expired connections may be useful anyway.
>> > >
>> > > We also need to work out exactly how this works with all SASL
>> mechanisms.
>> > > In particular, we have ensure that we can make it work with Kerberos
>> > since
>> > > we use the implementation from the JRE and I am not sure we have the
>> data
>> > > required on the broker or a way of extending the mechanism.
>> > >
>> > > Thoughts?
>> > >
>> > > On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino <rndg...@gmail.com>
>> > wrote:
>> > >
>> > > > Hi Rajini.  I'm thinking about how we deal with migration.   For
>> > example,
>> > > > let's say we have an existing 2.0.0 cluster using SASL/OAUTHBEARER
>> and
>> > we
>> > > > want to use this feature.  The desired end state is to have all
>> clients
>> > > and
>> > > > brokers migrated to a version that supports the feature (call it
>> 2.x)
>> > > with
>> > > > the feature turned on.  We need to document the supported path(s) to
>> > this
>> > > > end state.
>> > > >
>> > > > Here are a couple of scenarios with implications:
>> > > >
>> > > > 1) *Migrate client to 2.x and turn the feature on but broker is
>> still
>> > at
>> > > > 2.0.*  In this case the client is going to get an error when it
>> sends
>> > the
>> > > > SaslHandshakeRequest.  We want to avoid this.  It seems to me the
>> > client
>> > > > needs to know if the broker has been upgraded to the necessary
>> version
>> > > > before trying to re-authenticate, which makes me believe we need to
>> > bump
>> > > > the API version number in 2.x and the client has to check to see if
>> the
>> > > max
>> > > > version the broker supports meets a minimum standard before trying
>> to
>> > > > re-authenticate.  Do you agree?
>> > > >
>> > > > 2) *Migrate broker to 2.x and turn the feature on but client is
>> still
>> > at
>> > > > 2.0*.  In this case the broker is going to end up killing
>> connections
>> > > > periodically.   Again, we want to avoid this.  It's tempting to say
>> > > "don't
>> > > > do this" but I wonder if we can require installations to upgrade all
>> > > > clients before turning the feature on at the brokers.  Certainly we
>> can
>> > > say
>> > > > "don't do this" for inter-broker clients -- in other words, we can
>> say
>> > > that
>> > > > all brokers in a cluster should be upgraded before the feature is
>> > turned
>> > > on
>> > > > for any one of them -- but I don't know about our ability to say it
>> for
>> > > > non-broker clients.
>> > > >
>> > > > Now we consider the cases where both the brokers and the clients
>> have
>> > > been
>> > > > upgraded to 2.x.  When and where should the feature be turned on?
>> The
>> > > > inter-broker case is interesting because I don't think think we can
>> > > require
>> > > > installations to restart every broker with a new config where the
>> > feature
>> > > > is turned on at the same time.  Do you agree that we cannot require
>> > this
>> > > > and therefore must support installations restarting brokers with a
>> new
>> > > > config at whatever pace they need -- which may be quite slow
>> relative
>> > to
>> > > > token lifetimes?  Assuming you do agree, then there is going to be
>> the
>> > > case
>> > > > where some brokers are going to have the feature turned on and some
>> > > clients
>> > > > (definitely inter-broker clients at a minimum) are going to have the
>> > > > feature turned off.
>> > > >
>> > > > The above discussion assumes a single config on the broker side that
>> > > turns
>> > > > on both the inter-broker client re-authentication feature as well as
>> > the
>> > > > server-side expired-connection-kill feature, but now I'm thinking we
>> > need
>> > > > the ability to turn these features on independently, plus maybe we
>> > need a
>> > > > way to monitor the number of "active, expired" connections to the
>> > server
>> > > so
>> > > > that operators can be sure that all clients have been
>> upgraded/enabled
>> > > > before turning on the server-side expired-connection-kill feature.
>> > > >
>> > > > So the migration plan would be as follows:
>> > > >
>> > > > 1) Upgrade all brokers to 2.x.
>> > > > 2) After all brokers are upgraded, turn on re-authentication for all
>> > > > brokers at whatever rate is desired -- just eventually, at some
>> point,
>> > > get
>> > > > the client-side feature turned on for all brokers so that
>> inter-broker
>> > > > connections are re-authenticating.
>> > > > 3) In parallel with (1) and (2) above, upgrade clients to 2.x and
>> turn
>> > > > their re-authentication feature on.  Clients will check the API
>> version
>> > > and
>> > > > only re-authenticate to a broker that has also been upgraded (note
>> that
>> > > the
>> > > > ability of a broker to respond to a re-authentication cannot be
>> turned
>> > > off
>> > > > -- it is always on beginning with version 2.x, and it just sits
>> there
>> > > doing
>> > > > nothing if it isn't exercised by an enabled client)
>> > > > 4) After (1), (2), and (3) are complete, check the 2.x broker
>> metrics
>> > to
>> > > > confirm that there are no "active, expired" connections.  Once you
>> are
>> > > > satisfied that (1), (2), and (3) are indeed complete you can enable
>> the
>> > > > server-side expired-connection-kill feature on each broker via a
>> > restart
>> > > at
>> > > > whatever pace is desired.
>> > > >
>> > > > Comments?
>> > > >
>> > > > Ron
>> > > >
>> > > >
>> > > > On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino <rndg...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Ok, I am tempted to just say we go with the low-level approach
>> since
>> > it
>> > > > is
>> > > > > the quickest and seems to meet the clear requirements. We can
>> always
>> > > > adjust
>> > > > > later if we get to clarity on other requirements or we decide we
>> need
>> > > to
>> > > > > approach it differently for whatever reason.  But in the meantime,
>> > > before
>> > > > > fully committing to this decision, I would appreciate another
>> > > perspective
>> > > > > if someone has one.
>> > > > >
>> > > > > Ron
>> > > > >
>> > > > > On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram <
>> > > rajinisiva...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Hi Ron,
>> > > > >>
>> > > > >> Yes, I would leave out retries from this KIP for now. In the
>> future,
>> > > if
>> > > > >> there is a requirement for supporting retries, we can consider
>> it. I
>> > > > think
>> > > > >> we can support retries with either approach if we needed to, but
>> it
>> > > > would
>> > > > >> be better to do it along with other changes required to support
>> > > > >> authentication servers that are not highly available.
>> > > > >>
>> > > > >> For maintainability, I am biased, so it will be good to get the
>> > > > >> perspective
>> > > > >> of others in the community :-)
>> > > > >>
>> > > > >> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <
>> rndg...@gmail.com>
>> > > > wrote:
>> > > > >>
>> > > > >> > Hi Rajini.  Here is some feedback/some things I thought of.
>> > > > >> >
>> > > > >> > First, I just realized that from a timing perspective that I am
>> > not
>> > > > sure
>> > > > >> > retry is going to be necessary.  The background login refresh
>> > thread
>> > > > >> > triggers re-authentication when it refreshes the client's
>> > > credential.
>> > > > >> The
>> > > > >> > OAuth infrastructure has to be available in order for this
>> refresh
>> > > to
>> > > > >> > succeed (the background thread repeatedly retries if it can't
>> > > refresh
>> > > > >> the
>> > > > >> > credential, and that retry loop handles any temporary
>> outage).  If
>> > > > >> clients
>> > > > >> > are told to re-authenticate when the credential is refreshed
>> **and
>> > > > they
>> > > > >> > actually re-authenticate at that point** then it is highly
>> > unlikely
>> > > > that
>> > > > >> > the OAuth infrastructure would fail within those intervening
>> > > > >> milliseconds.
>> > > > >> > So we don't need re-authentication retry in this KIP as long as
>> > > retry
>> > > > >> > starts immediately.  The low-level prototype as currently coded
>> > > > doesn't
>> > > > >> > actually start re-authentication until the connection is
>> > > subsequently
>> > > > >> used,
>> > > > >> > and that could take a while.  But then again, if the connection
>> > > isn't
>> > > > >> used
>> > > > >> > for some period of time, then the lost value of having to
>> abandon
>> > > the
>> > > > >> > connection is lessened anyway.  Plus, as you pointed out, OAuth
>> > > > >> > infrastructure is assumed to be highly available.
>> > > > >> >
>> > > > >> > Does this makes sense, and would you be willing to say that
>> retry
>> > > > isn't
>> > > > >> a
>> > > > >> > necessary requirement?  I'm tempted but would like to hear your
>> > > > >> perspective
>> > > > >> > first.
>> > > > >> >
>> > > > >> > Interleaving/latency and maintainability (more than lines of
>> code)
>> > > are
>> > > > >> the
>> > > > >> > two remaining areas of comparison.  I did not add the
>> > > maintainability
>> > > > >> issue
>> > > > >> > to the KIP yet, but before I do I thought I would address it
>> here
>> > > > first
>> > > > >> to
>> > > > >> > see if we can come to consensus on it because I'm not sure I
>> see
>> > the
>> > > > >> > high-level approach as being hard to maintain (yet -- I could
>> be
>> > > > >> > convinced/convince myself; we'll see).  I do want to make sure
>> we
>> > > are
>> > > > on
>> > > > >> > the same page about what is required to add re-authentication
>> > > support
>> > > > in
>> > > > >> > the high-level case.  Granted, the amount is zero in the
>> low-level
>> > > > case,
>> > > > >> > and it doesn't get any better that that, but the amount in the
>> > > > >> high-level
>> > > > >> > case is very low -- just a few lines of code.  For example:
>> > > > >> >
>> > > > >> > KafkaAdminClient:
>> > > > >> > https://github.com/apache/kafka/pull/5582/commits/4fa70f38b9
>> > > > >> > d33428ff98b64a3a2bd668f5f28c38#diff-
>> > 6869b8fccf6b098cbcb0676e8ceb26a7
>> > > > >> > It is the same few lines of code for KafkaConsumer,
>> KafkaProducer,
>> > > > >> > WorkerGroupMember, and TransactionMarkerChannelManager
>> > > > >> >
>> > > > >> > The two synchronous I/O use cases are ControllerChannelManager
>> and
>> > > > >> > ReplicaFetcherBlockingSend (via ReplicaFetcherThread), and they
>> > > > require
>> > > > >> a
>> > > > >> > little bit more -- but not much.
>> > > > >> >
>> > > > >> > Thoughts?
>> > > > >> >
>> > > > >> > Ron
>> > > > >> >
>> > > > >> > On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <
>> rndg...@gmail.com>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > > Thanks, Rajini.  Before I digest/respond to that, here's an
>> > update
>> > > > >> that I
>> > > > >> > > just completed.
>> > > > >> > >
>> > > > >> > > I added a commit to the PR (
>> > > > >> https://github.com/apache/kafka/pull/5582/)
>> > > > >> > > that implements server-side kill of expired OAUTHBEARER
>> > > connections.
>> > > > >> No
>> > > > >> > > tests yet since we still haven't settled on a final approach
>> > > > >> (low-level
>> > > > >> > vs.
>> > > > >> > > high-level).
>> > > > >> > >
>> > > > >> > > I also updated the KIP to reflect the latest discussion and
>> PR
>> > as
>> > > > >> > follows:
>> > > > >> > >
>> > > > >> > >    - Include support for brokers killing connections as part
>> of
>> > > this
>> > > > >> KIP
>> > > > >> > >    (rather than deferring it to a future KIP as was
>> originally
>> > > > >> > mentioned; the
>> > > > >> > >    PR now includes it as mentioned -- it was very easy to
>> add)
>> > > > >> > >    - Added metrics (they will mirror existing ones related to
>> > > > >> > >    authentications; I have not added those to the PR)
>> > > > >> > >    - Updated the implementation description to reflect the
>> > current
>> > > > >> state
>> > > > >> > >    of the PR, which is a high-level, one-size-fits-all
>> approach
>> > > (as
>> > > > >> > opposed to
>> > > > >> > >    my initial, even-higher-level approach)
>> > > > >> > >    - Added a "Rejected Alternative" for the first version of
>> the
>> > > PR,
>> > > > >> > >    which injected requests directly into synchronous I/O
>> > clients'
>> > > > >> queues
>> > > > >> > >    - Added a "Rejected Alternative" for the low-level
>> approach
>> > as
>> > > > >> > >    suggested, but of course we have not formally decided to
>> > reject
>> > > > >> this
>> > > > >> > >    approach or adopt the current PR implementation.
>> > > > >> > >
>> > > > >> > > I'll think about where we stand some more before responding
>> > again.
>> > > > >> > Thanks
>> > > > >> > > for the above reply.
>> > > > >> > >
>> > > > >> > > Ron
>> > > > >> > >
>> > > > >> > > On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram <
>> > > > >> rajinisiva...@gmail.com>
>> > > > >> > > wrote:
>> > > > >> > >
>> > > > >> > >> Hi Ron,
>> > > > >> > >>
>> > > > >> > >> Thank you for summarising, I think it covers the differences
>> > > > between
>> > > > >> the
>> > > > >> > >> two approaches well.
>> > > > >> > >>
>> > > > >> > >> A few minor points to answer the questions in there:
>> > > > >> > >>
>> > > > >> > >> 1) When re-authetication is initiated in the Selector during
>> > > > poll(),
>> > > > >> we
>> > > > >> > >> can
>> > > > >> > >> move an idle channel to re-authentication state. It is
>> similar
>> > to
>> > > > >> > >> injecting
>> > > > >> > >> requests, but achieved by changing channel back to
>> > authenticating
>> > > > >> state.
>> > > > >> > >>
>> > > > >> > >> 3) To clarify why I think re-authentication should fit in
>> with
>> > > our
>> > > > >> > >> authentication design: My point was not about a specific
>> > > connection
>> > > > >> > being
>> > > > >> > >> usable or not usable. It was about what happens at the
>> client
>> > API
>> > > > >> level.
>> > > > >> > >> Our client API (producer/consumer/admin client etc.)
>> currently
>> > > > assume
>> > > > >> > that
>> > > > >> > >> a single broker authentication failure is a fatal error
>> that is
>> > > > never
>> > > > >> > >> retried because we assume that broker only ever fails an
>> > > > >> authentication
>> > > > >> > >> request if credentials are invalid. If we ever decide to
>> > support
>> > > > >> cases
>> > > > >> > >> where broker occasionally fails an authentication request
>> due
>> > to
>> > > a
>> > > > >> > >> transient failure, we need to do more around how we handle
>> > > > >> > authentication
>> > > > >> > >> failures in clients. We may decide that it is ok to close
>> the
>> > > > >> connection
>> > > > >> > >> for authentication and not for re-authentication as you
>> > > mentioned,
>> > > > >> but
>> > > > >> > we
>> > > > >> > >> need to change the way this disconnection is handled by
>> > clients.
>> > > So
>> > > > >> IMO,
>> > > > >> > >> we
>> > > > >> > >> should either add support for transient retriable
>> > authentication
>> > > > >> > failures
>> > > > >> > >> properly or not retry for any scenario. Personally, I don't
>> > think
>> > > > we
>> > > > >> > would
>> > > > >> > >> want to retry all authentication failures even if it is a
>> > > > >> > >> re-authentication, I think we could (at some point in
>> future),
>> > > > allow
>> > > > >> > >> brokers to return an error code that indicates that it is a
>> > > > transient
>> > > > >> > >> broker-side failure rather than invalid credentials and
>> handle
>> > > the
>> > > > >> error
>> > > > >> > >> differently. I see no reason at that point why we wouldn't
>> > handle
>> > > > >> > >> authentication and re-authentication in the same way.
>> > > > >> > >>
>> > > > >> > >> 4) As you said, the high-level approach would be bigger than
>> > the
>> > > > >> > low-level
>> > > > >> > >> approach in terms of LOC. But I wouldn't be too concerned
>> about
>> > > > >> lines of
>> > > > >> > >> code. My bigger concern was about modularity. Our security
>> code
>> > > is
>> > > > >> > already
>> > > > >> > >> complex, protocols like Kerberos and SSL that we use from
>> the
>> > JRE
>> > > > >> make
>> > > > >> > >> problem diagnosis hard. Async I/O makes the networking code
>> > > > complex.
>> > > > >> You
>> > > > >> > >> need to understand networking layer to work with the
>> security
>> > > > layer,
>> > > > >> but
>> > > > >> > >> the rest of the code base doesn't rely on knowledge of
>> > > > >> network/security
>> > > > >> > >> layers. My main concern about the high-level approach is
>> that
>> > it
>> > > > >> spans
>> > > > >> > >> these boundaries, making it harder to maintain in the long
>> run.
>> > > > >> > >>
>> > > > >> > >>
>> > > > >> > >> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski <
>> > > > >> > >> stanis...@confluent.io> wrote:
>> > > > >> > >>
>> > > > >> > >> > Hi Ron, Rajini
>> > > > >> > >> >
>> > > > >> > >> > Thanks for summarizing the discussion so far, Ron!
>> > > > >> > >> >
>> > > > >> > >> > 1) How often do we have such long-lived connection
>> idleness
>> > > (e.g
>> > > > >> 5-10
>> > > > >> > >> > minutes) in practice?
>> > > > >> > >> >
>> > > > >> > >> > 3) I agree that retries for re-authentication are useful.
>> > > > >> > >> >
>> > > > >> > >> > 4) The interleaving of requests sounds like a great
>> feature
>> > to
>> > > > >> have,
>> > > > >> > but
>> > > > >> > >> > the tradeoff against code complexity is a tough one. I
>> would
>> > > > >> > personally
>> > > > >> > >> go
>> > > > >> > >> > with the simpler approach since you could always add
>> > > interleaving
>> > > > >> on
>> > > > >> > >> top if
>> > > > >> > >> > the community decides the latency should be better.
>> > > > >> > >> >
>> > > > >> > >> > Best,
>> > > > >> > >> > Stanislav
>> > > > >> > >> >
>> > > > >> > >> > On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <
>> > > rndg...@gmail.com
>> > > > >
>> > > > >> > >> wrote:
>> > > > >> > >> >
>> > > > >> > >> > > Hi everyone.  I've updated the PR to reflect the latest
>> > > > >> conclusions
>> > > > >> > >> from
>> > > > >> > >> > > this ongoing discussion.  The KIP still needs the
>> suggested
>> > > > >> > updates; I
>> > > > >> > >> > will
>> > > > >> > >> > > do that later this week.  II agree with Rajini that some
>> > > > >> additional
>> > > > >> > >> > > feedback from the community at large would be very
>> helpful
>> > at
>> > > > >> this
>> > > > >> > >> point
>> > > > >> > >> > in
>> > > > >> > >> > > time.
>> > > > >> > >> > >
>> > > > >> > >> > > Here's where we stand.
>> > > > >> > >> > >
>> > > > >> > >> > > We have 2 separate implementations for re-authenticating
>> > > > existing
>> > > > >> > >> > > connections -- a so-called "low-level" approach and a
>> > > > (relatively
>> > > > >> > >> > speaking)
>> > > > >> > >> > > "high-level" approach -- and we agree on how they
>> should be
>> > > > >> > compared.
>> > > > >> > >> > > Specifically, Rajini has provided a mechanism that works
>> > at a
>> > > > >> > >> relatively
>> > > > >> > >> > > low level in the stack by intercepting network sends and
>> > > > queueing
>> > > > >> > >> them up
>> > > > >> > >> > > while re-authentication happens; then the queued sends
>> are
>> > > sent
>> > > > >> > after
>> > > > >> > >> > > re-authentication succeeds, or the connection is closed
>> if
>> > > > >> > >> > > re-authentication fails.  See the prototype commit at
>> > > > >> > >> > >
>> > > > >> > >> > > https://github.com/rajinisivaram/kafka/commit/
>> > b9d711907ad843
>> > > > >> > >> > c11d17e80d6743bfb1d4e3f3fd
>> > > > >> > >> > > .
>> > > > >> > >> > >
>> > > > >> > >> > > I have an implementation that works higher up in the
>> stack;
>> > > it
>> > > > >> > injects
>> > > > >> > >> > > authentication requests into the KafkaClient via a new
>> > method
>> > > > >> added
>> > > > >> > to
>> > > > >> > >> > that
>> > > > >> > >> > > interface, and the implementation (i.e. NetworkClient)
>> > sends
>> > > > them
>> > > > >> > when
>> > > > >> > >> > > poll() is called.  The updated PR is available at
>> > > > >> > >> > > https://github.com/apache/kafka/pull/5582/.
>> > > > >> > >> > >
>> > > > >> > >> > > Here is how we think these two approaches should be
>> > compared.
>> > > > >> > >> > >
>> > > > >> > >> > > 1) *When re-authentication starts*.  The low-level
>> approach
>> > > > >> > initiates
>> > > > >> > >> > > re-authentication only if/when the connection is
>> actually
>> > > used,
>> > > > >> so
>> > > > >> > it
>> > > > >> > >> may
>> > > > >> > >> > > start after the existing credential expires; the
>> current PR
>> > > > >> > >> > implementation
>> > > > >> > >> > > injects re-authentication requests into the existing
>> flow,
>> > > and
>> > > > >> > >> > > re-authentication starts immediately regardless of
>> whether
>> > or
>> > > > not
>> > > > >> > the
>> > > > >> > >> > > connection is being used for something else.  Rajini
>> > believes
>> > > > the
>> > > > >> > >> > low-level
>> > > > >> > >> > > approach can be adjusted to re-authenticate idle
>> > connections
>> > > > >> > (Rajini,
>> > > > >> > >> I
>> > > > >> > >> > > don't see how this can be done without injecting
>> requests,
>> > > > which
>> > > > >> is
>> > > > >> > >> what
>> > > > >> > >> > > the high-level connection is doing, but I am probably
>> > missing
>> > > > >> > >> something
>> > > > >> > >> > --
>> > > > >> > >> > > no need to go into it unless it is easy to explain.)
>> > > > >> > >> > >
>> > > > >> > >> > > 2) *When requests not related to re-authentication are
>> able
>> > > to
>> > > > >> use
>> > > > >> > the
>> > > > >> > >> > > re-authenticating connection*.  The low-level approach
>> > > finishes
>> > > > >> > >> > > re-authentication completely before allowing anything
>> else
>> > to
>> > > > >> > traverse
>> > > > >> > >> > the
>> > > > >> > >> > > connection, and we agree that this is how the low-level
>> > > > >> > implementation
>> > > > >> > >> > must
>> > > > >> > >> > > work without significant work/changes.  The current PR
>> > > > >> > implementation
>> > > > >> > >> > > interleaves re-authentication requests with the existing
>> > flow
>> > > > in
>> > > > >> the
>> > > > >> > >> > > asynchronous I/O uses cases (Consumer, Producer, Admin
>> > > Client,
>> > > > >> > etc.),
>> > > > >> > >> and
>> > > > >> > >> > > it allows the two synchronous use cases
>> > > > (ControllerChannelManager
>> > > > >> > and
>> > > > >> > >> > > ReplicaFetcherBlockingSend/ReplicaFetcherThread) to
>> decide
>> > > > which
>> > > > >> > style
>> > > > >> > >> > they
>> > > > >> > >> > > want.  I (somewhat arbitrarily, to prove out the
>> > flexibility
>> > > of
>> > > > >> the
>> > > > >> > >> > > high-level approach) coded ControllerChannelManager to
>> do
>> > > > >> complete,
>> > > > >> > >> > > non-interleaved re-authentication and
>> > > > ReplicaFetcherBlockingSend/
>> > > > >> > >> > > ReplicaFetcherThread to interleave requests.  The
>> approach
>> > > has
>> > > > >> > >> impacts on
>> > > > >> > >> > > the maximum size of latency spikes: interleaving
>> requests
>> > can
>> > > > >> > decrease
>> > > > >> > >> > the
>> > > > >> > >> > > latency.  The benefit of interleaving is tough to
>> evaluate
>> > > > >> because
>> > > > >> > it
>> > > > >> > >> > isn't
>> > > > >> > >> > > clear what the latency requirement really is.  Note that
>> > > > >> > >> > re-authentication
>> > > > >> > >> > > can entail several network round-trips between the
>> client
>> > and
>> > > > the
>> > > > >> > >> broker.
>> > > > >> > >> > > Comments in this area would be especially appreciated.
>> > > > >> > >> > >
>> > > > >> > >> > > 3) *What happens if re-authentication fails (i.e. retry
>> > > > >> > >> capability)*.  I
>> > > > >> > >> > > think this is where we have not yet settled on what the
>> > > > >> requirement
>> > > > >> > is
>> > > > >> > >> > > (even more so than the issue of potential latency
>> > mitigation
>> > > > >> > >> > > requirements).  Rajini, you had mentioned that
>> > > > re-authentication
>> > > > >> > >> should
>> > > > >> > >> > > work the same way as authentication, but I see the two
>> > > > >> situations as
>> > > > >> > >> > being
>> > > > >> > >> > > asymmetric.  In the authentication case, if
>> authentication
>> > > > fails,
>> > > > >> > the
>> > > > >> > >> > > connection was never fully established and could not be
>> > used,
>> > > > >> and it
>> > > > >> > >> is
>> > > > >> > >> > > closed -- TLS negotiation may have taken place, so that
>> > > effort
>> > > > is
>> > > > >> > >> lost,
>> > > > >> > >> > but
>> > > > >> > >> > > beyond that nothing else is lost.  In the
>> re-authentication
>> > > > case
>> > > > >> the
>> > > > >> > >> > > connection is fully established, it is in use, and in
>> fact
>> > it
>> > > > can
>> > > > >> > >> remain
>> > > > >> > >> > in
>> > > > >> > >> > > use for some time going forward as long as the existing
>> > > > >> credential
>> > > > >> > >> > remains
>> > > > >> > >> > > unexpired; abandoning it at this point due to a
>> > > > re-authentication
>> > > > >> > >> failure
>> > > > >> > >> > > (which can occur due to no fault of the client -- i.e. a
>> > > remote
>> > > > >> > >> directory
>> > > > >> > >> > > or OAuth server being temporarily down) abandons a fully
>> > > > >> functioning
>> > > > >> > >> > > connection with remaining lifetime.  Am I thinking about
>> > this
>> > > > >> > >> reasonably?
>> > > > >> > >> > > I still tend to believe that retries of failed
>> > > > re-authentications
>> > > > >> > are
>> > > > >> > >> a
>> > > > >> > >> > > requirement.
>> > > > >> > >> > >
>> > > > >> > >> > > 4) *Size*. The low-level approach is smaller in terms of
>> > > code.
>> > > > >> It
>> > > > >> > is
>> > > > >> > >> a
>> > > > >> > >> > > prototype, so it doesn't have everything the high-level
>> > > > approach
>> > > > >> > has,
>> > > > >> > >> > but I
>> > > > >> > >> > > have split the PR for the high-level approach into two
>> > > separate
>> > > > >> > >> commits
>> > > > >> > >> > to
>> > > > >> > >> > > facilitate a more apples-to-apples comparison: the first
>> > > commit
>> > > > >> > >> contains
>> > > > >> > >> > > infrastructure that would have to be added to the
>> low-level
>> > > > >> > prototype
>> > > > >> > >> if
>> > > > >> > >> > we
>> > > > >> > >> > > went with that, and the second commit has everything
>> > specific
>> > > > to
>> > > > >> the
>> > > > >> > >> > > high-level approach.  So comparing the low-level
>> prototype
>> > > > commit
>> > > > >> > >> (~350
>> > > > >> > >> > > changes) to the second commit in the high-level PR
>> (~1400
>> > > > >> changes)
>> > > > >> > is
>> > > > >> > >> > > reasonable. The high-level approach at this point is
>> about
>> > 4
>> > > > >> times
>> > > > >> > as
>> > > > >> > >> > big,
>> > > > >> > >> > > though it does have lots of error/sanity checking and
>> > > > >> functionality
>> > > > >> > >> (such
>> > > > >> > >> > > as retry) that would to at least some degree have to be
>> > added
>> > > > to
>> > > > >> the
>> > > > >> > >> > > low-level implementation.  Still, the high-level
>> approach
>> > is
>> > > > (and
>> > > > >> > will
>> > > > >> > >> > > likely remain) somewhat bigger than the low-level
>> approach.
>> > > > >> > >> > >
>> > > > >> > >> > > I hope I presented this in a balanced and accurate way;
>> > > Rajini,
>> > > > >> > please
>> > > > >> > >> > > correct me if I got anything wrong or left out anything
>> > > > >> important.
>> > > > >> > >> > > Apologies in advance if that is the case -- it is
>> > unintended.
>> > > > >> > >> > >
>> > > > >> > >> > > Again, comments/discussion from the wider community at
>> this
>> > > > >> juncture
>> > > > >> > >> > would
>> > > > >> > >> > > be especially appreciated.
>> > > > >> > >> > >
>> > > > >> > >> > > Ron
>> > > > >> > >> > >
>> > > > >> > >> > >
>> > > > >> > >> > >
>> > > > >> > >> > >
>> > > > >> > >> > > On Mon, Sep 10, 2018 at 7:42 AM Rajini Sivaram <
>> > > > >> > >> rajinisiva...@gmail.com>
>> > > > >> > >> > > wrote:
>> > > > >> > >> > >
>> > > > >> > >> > > > Hi Ron,
>> > > > >> > >> > > >
>> > > > >> > >> > > > Thank you for the analysis. Yes, I agree with the
>> > > differences
>> > > > >> you
>> > > > >> > >> have
>> > > > >> > >> > > > pointed out between the two approaches.
>> > > > >> > >> > > >
>> > > > >> > >> > > >    1. Re-authenticating idle connections (or
>> connections
>> > > > >> nearing
>> > > > >> > >> idle
>> > > > >> > >> > > >    timeout): With the lower-level approach, there is
>> > > nothing
>> > > > >> > >> stopping
>> > > > >> > >> > us
>> > > > >> > >> > > > from
>> > > > >> > >> > > >    re-authenticating idle connections. We could either
>> > > > consider
>> > > > >> > idle
>> > > > >> > >> > > >    connections for re-authentication or we could
>> > implement
>> > > > >> code in
>> > > > >> > >> > broker
>> > > > >> > >> > > > to
>> > > > >> > >> > > >    kill connections that deals with delays introduced
>> by
>> > > > >> idleness.
>> > > > >> > >> > Since
>> > > > >> > >> > > we
>> > > > >> > >> > > >    expect token lifetimes to be much higher than
>> > connection
>> > > > >> expiry
>> > > > >> > >> > time,
>> > > > >> > >> > > >    either would work. And the second approach is
>> probably
>> > > > >> better.
>> > > > >> > >> > > >    2. Interleaving authentication requests and
>> > application
>> > > > >> > requests
>> > > > >> > >> to
>> > > > >> > >> > > >    reduce impact on latency: It is not impossible to
>> > > > interleave
>> > > > >> > with
>> > > > >> > >> > the
>> > > > >> > >> > > >    low-level approach, but it adds complexity and
>> reduces
>> > > the
>> > > > >> > value
>> > > > >> > >> of
>> > > > >> > >> > > this
>> > > > >> > >> > > >    approach. So it will be good to understand the
>> > > > requirements
>> > > > >> in
>> > > > >> > >> terms
>> > > > >> > >> > > of
>> > > > >> > >> > > >    latency. In terms of comparing the two approaches,
>> it
>> > is
>> > > > >> fair
>> > > > >> > to
>> > > > >> > >> > > compare
>> > > > >> > >> > > >    non-interleaved low-level with the naturally
>> > interleaved
>> > > > >> > >> high-level
>> > > > >> > >> > > >    approach.
>> > > > >> > >> > > >    3. Retries: I don't think requirement for retries
>> is
>> > > > >> different
>> > > > >> > >> > between
>> > > > >> > >> > > >    authentication and re-authentication from a
>> client's
>> > > point
>> > > > >> of
>> > > > >> > >> view.
>> > > > >> > >> > We
>> > > > >> > >> > > >    should either support retries for both or not at
>> all.
>> > We
>> > > > >> > >> currently
>> > > > >> > >> > > > process
>> > > > >> > >> > > >    authentication failures as fatal errors. We expect
>> > that
>> > > if
>> > > > >> you
>> > > > >> > >> are
>> > > > >> > >> > > using
>> > > > >> > >> > > >    external authentication servers, those servers
>> will be
>> > > > >> highly
>> > > > >> > >> > > available.
>> > > > >> > >> > > >    Our built-in authentication mechanisms avoid
>> accessing
>> > > > >> external
>> > > > >> > >> > > servers
>> > > > >> > >> > > > in
>> > > > >> > >> > > >    the authentication path (during authentication,
>> > > > >> verification is
>> > > > >> > >> > > > performed
>> > > > >> > >> > > >    using private/public keys or cache lookup). This is
>> > > > >> necessary
>> > > > >> > in
>> > > > >> > >> our
>> > > > >> > >> > > > design
>> > > > >> > >> > > >    since authentications are performed synchronously
>> on
>> > the
>> > > > >> > network
>> > > > >> > >> > > thread.
>> > > > >> > >> > > >
>> > > > >> > >> > > > One more point with the lower-level mechanism: If one
>> day
>> > > we
>> > > > >> > decided
>> > > > >> > >> > that
>> > > > >> > >> > > > we want to extend this for SSL, it would be possible
>> to
>> > > > change
>> > > > >> > just
>> > > > >> > >> the
>> > > > >> > >> > > > transport layer and make it work for SSL. We may never
>> > want
>> > > > to
>> > > > >> do
>> > > > >> > >> this,
>> > > > >> > >> > > but
>> > > > >> > >> > > > thought I would point out anyway.
>> > > > >> > >> > > >
>> > > > >> > >> > > > I think the next step would be to add the low-level
>> > > approach
>> > > > to
>> > > > >> > the
>> > > > >> > >> KIP
>> > > > >> > >> > > > under `Rejected Alternatives` with details on the
>> > > > differences.
>> > > > >> And
>> > > > >> > >> more
>> > > > >> > >> > > > detail on requirements covering latency and retries in
>> > the
>> > > > >> body of
>> > > > >> > >> the
>> > > > >> > >> > > > KIP.  For "Rejected Alternatives", can we add
>> > sub-sections
>> > > (I
>> > > > >> > think
>> > > > >> > >> > there
>> > > > >> > >> > > > are two approaches there now, but you need to read
>> > through
>> > > to
>> > > > >> > figure
>> > > > >> > >> > that
>> > > > >> > >> > > > out, so sub-titles would help).
>> > > > >> > >> > > >
>> > > > >> > >> > > > Once the KIP is updated, it will be good to get more
>> > > feedback
>> > > > >> from
>> > > > >> > >> the
>> > > > >> > >> > > > community to decide on the approach to adopt.
>> > > > >> > >> > > >
>> > > > >> > >> > > >
>> > > > >> > >> > > > On Sat, Sep 8, 2018 at 1:27 PM, Ron Dagostino <
>> > > > >> rndg...@gmail.com>
>> > > > >> > >> > wrote:
>> > > > >> > >> > > >
>> > > > >> > >> > > > > Hi yet again, Rajini.  If we decide that
>> interleaving
>> > is
>> > > > >> > >> impossible
>> > > > >> > >> > > with
>> > > > >> > >> > > > > the low-level approach but we still want to at least
>> > > > support
>> > > > >> the
>> > > > >> > >> > > > > possibility given that it reduces the size of
>> latency
>> > > > spikes,
>> > > > >> > >> then I
>> > > > >> > >> > do
>> > > > >> > >> > > > > have a suggestion.  I documented in the KIP how I
>> > > rejected
>> > > > a
>> > > > >> > >> single,
>> > > > >> > >> > > > > one-size-fits-all queue approach because we don't
>> know
>> > > when
>> > > > >> > poll()
>> > > > >> > >> > will
>> > > > >> > >> > > > be
>> > > > >> > >> > > > > called in the synchronous I/O use case.  An
>> advantage
>> > of
>> > > > >> such an
>> > > > >> > >> > > approach
>> > > > >> > >> > > > > -- if it would have worked, which it didn't -- is
>> that
>> > it
>> > > > >> would
>> > > > >> > >> have
>> > > > >> > >> > > been
>> > > > >> > >> > > > > transparent to the owner of the NetworkClient
>> instance
>> > > > (this
>> > > > >> is
>> > > > >> > >> one
>> > > > >> > >> > of
>> > > > >> > >> > > > the
>> > > > >> > >> > > > > big advantages of the low-level approach, after
>> all).
>> > > But
>> > > > we
>> > > > >> > >> could
>> > > > >> > >> > > make
>> > > > >> > >> > > > > the one-size-fits-all queue approach work if we are
>> > > willing
>> > > > >> to
>> > > > >> > >> impose
>> > > > >> > >> > > > some
>> > > > >> > >> > > > > requirements on synchronous I/O users of
>> NetworkClient.
>> > > > >> > >> > Specifically,
>> > > > >> > >> > > we
>> > > > >> > >> > > > > could add a method to the
>> > > > >> org.apache.kafka.clients.KafkaClient
>> > > > >> > >> > > interface
>> > > > >> > >> > > > > (which NetworkClient implements) as follows:
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >     /**
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * Return true if any node has re-authentication
>> > > > requests
>> > > > >> > >> waiting
>> > > > >> > >> > > to
>> > > > >> > >> > > > be
>> > > > >> > >> > > > > sent. A
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * call to {@link #poll(long, long)} is
>> required to
>> > > > send
>> > > > >> > such
>> > > > >> > >> > > > requests.
>> > > > >> > >> > > > > An owner
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * of this instance that does not implement a
>> run
>> > > loop
>> > > > to
>> > > > >> > >> > > repeatedly
>> > > > >> > >> > > > > call
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * {@link #poll(long, long)} but instead only
>> sends
>> > > > >> requests
>> > > > >> > >> > > > > synchronously
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * on-demand must call this method periodically
>> --
>> > > and
>> > > > >> > invoke
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * {@link #poll(long, long)} if the return
>> value is
>> > > > >> {@code
>> > > > >> > >> true}
>> > > > >> > >> > --
>> > > > >> > >> > > > to
>> > > > >> > >> > > > > ensure
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * that any re-authentication requests that have
>> > ben
>> > > > >> > injected
>> > > > >> > >> are
>> > > > >> > >> > > > sent
>> > > > >> > >> > > > > in a
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * timely fashion.
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      *
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      * @return true if any node has
>> re-authentication
>> > > > >> requests
>> > > > >> > >> > waiting
>> > > > >> > >> > > to
>> > > > >> > >> > > > > be sent
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >      */
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >     default boolean hasReauthenticationRequests() {
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >         return false;
>> > > > >> > >> > > > >
>> > > > >> > >> > > > >     }
>> > > > >> > >> > > > >
>> > > > >> > >> > > > > If we are willing to add an additional method like
>> this
>> > > and
>> > > > >> make
>> > > > >> > >> the
>> > > > >> > >> > > > minor
>> > > > >> > >> > > > > adjustments to the code associated with the few
>> > > synchronous
>> > > > >> use
>> > > > >> > >> cases
>> > > > >> > >> > > > then
>> > > > >> > >> > > > > I think the high-level approach will work.  We would
>> > then
>> > > > >> have
>> > > > >> > the
>> > > > >> > >> > > > > possibility of further parameterizing the various
>> > > > synchronous
>> > > > >> > I/O
>> > > > >> > >> use
>> > > > >> > >> > > > > cases.  For example, there are currently 3:
>> > > > >> > >> > > > >
>> > > > >> > >> > > > > ControllerChannelManager
>> > > > >> > >> > > > > KafkaProducer, via Sender
>> > > > >> > >> > > > > ReplicaFetcherBlockingSend, via ReplicaFetcherThread
>> > > > >> > >> > > > >
>> > > > >> > >> > > > > Is it conceivable that one of these use cases might
>> be
>> > > more
>> > > > >> > >> > > > > latency-sensitive than others and might desire
>> > > > >> interleaving?  A
>> > > > >> > >> > > > high-level
>> > > > >> > >> > > > > approach would give us the option of configuring
>> each
>> > use
>> > > > >> case
>> > > > >> > >> > > > accordingly.
>> > > > >> > >> > > > >
>> > > > >> > >> > > > > Ron
>> > > > >> > >> > > > >
>> > > > >> > >> > > > > On Fri, Sep 7, 2018 at 10:50 PM Ron Dagostino <
>> > > > >> > rndg...@gmail.com>
>> > > > >> > >> > > wrote:
>> > > > >> > >> > > > >
>> > > > >> > >> > > > > > Hi again, Rajini.  It occurs to me that from a
>> > > *behavior*
>> > > > >> > >> > perspective
>> > > > >> > >> > > > > > there are really 3 fundamental differences between
>> > the
>> > > > >> > low-level
>> > > > >> > >> > > > approach
>> > > > >> > >> > > > > > you provided and the high-level approach as it
>> > > currently
>> > > > >> > exists
>> > > > >> > >> in
>> > > > >> > >> > > the
>> > > > >> > >> > > > > PR:
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > 1) *When re-authentication starts*.  The low-level
>> > > > approach
>> > > > >> > >> > initiates
>> > > > >> > >> > > > > > re-authentication only if/when the connection is
>> > > actually
>> > > > >> > used,
>> > > > >> > >> so
>> > > > >> > >> > it
>> > > > >> > >> > > > may
>> > > > >> > >> > > > > > start after the existing credential expires; the
>> > > current
>> > > > PR
>> > > > >> > >> > > > > implementation
>> > > > >> > >> > > > > > injects re-authentication requests into the
>> existing
>> > > > flow,
>> > > > >> and
>> > > > >> > >> > > > > > re-authentication starts immediately regardless of
>> > > > whether
>> > > > >> or
>> > > > >> > >> not
>> > > > >> > >> > the
>> > > > >> > >> > > > > > connection is being used for something else.
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > 2) *When requests not related to re-authentication
>> > can
>> > > > use
>> > > > >> the
>> > > > >> > >> > > > > > re-authenticating connection*.  The low-level
>> > approach
>> > > > >> > finishes
>> > > > >> > >> > > > > > re-authentication completely before allowing
>> anything
>> > > > else
>> > > > >> to
>> > > > >> > >> > travers
>> > > > >> > >> > > > the
>> > > > >> > >> > > > > > connection; the current PR implementation
>> interleaves
>> > > > >> > >> > > re-authentication
>> > > > >> > >> > > > > > requests with existing flow requests.
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > 3) *What happens if re-authentication fails*.  The
>> > > > >> low-level
>> > > > >> > >> > approach
>> > > > >> > >> > > > > > results in a closed connection and does not
>> support
>> > > > >> retries --
>> > > > >> > >> at
>> > > > >> > >> > > least
>> > > > >> > >> > > > > as
>> > > > >> > >> > > > > > currently implemented; the current PR
>> implementation
>> > > > >> supports
>> > > > >> > >> > > retries.
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > Do you agree that these are the (only?) behavioral
>> > > > >> > differences?
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > For these facets of behavior, I wonder what the
>> > > > >> requirements
>> > > > >> > are
>> > > > >> > >> > for
>> > > > >> > >> > > > this
>> > > > >> > >> > > > > > feature.  I think they are as follows:
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > 1) *When re-authentication starts*: I don't think
>> we
>> > > > have a
>> > > > >> > hard
>> > > > >> > >> > > > > > requirement here when considered in isolation --
>> > > whether
>> > > > >> > >> > > > > re-authentication
>> > > > >> > >> > > > > > starts immediately or is delayed until the
>> connection
>> > > is
>> > > > >> used
>> > > > >> > >> > > probably
>> > > > >> > >> > > > > > doesn't matter.
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > 2) *When requests not related to re-authentication
>> > can
>> > > > use
>> > > > >> the
>> > > > >> > >> > > > > > re-authenticating connection*: there is a tradeoff
>> > here
>> > > > >> > between
>> > > > >> > >> > > latency
>> > > > >> > >> > > > > > and ability to debug re-authentication problems.
>> > > > Delaying
>> > > > >> use
>> > > > >> > >> of
>> > > > >> > >> > the
>> > > > >> > >> > > > > > connection until re-authentication finishes
>> results
>> > in
>> > > > >> bigger
>> > > > >> > >> > latency
>> > > > >> > >> > > > > > spikes but keeps re-authentication requests
>> somewhat
>> > > > >> together;
>> > > > >> > >> > > > > interleaving
>> > > > >> > >> > > > > > minimizes the size of individual latency spikes
>> but
>> > > adds
>> > > > >> some
>> > > > >> > >> > > > separation
>> > > > >> > >> > > > > > between the requests.
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > 3) *What happens if re-authentication fails*: I
>> > believe
>> > > > we
>> > > > >> > have
>> > > > >> > >> a
>> > > > >> > >> > > clear
>> > > > >> > >> > > > > > requirement for retry capability given what I have
>> > > > written
>> > > > >> > >> > > previously.
>> > > > >> > >> > > > > Do
>> > > > >> > >> > > > > > you agree?  Note that while the current low-level
>> > > > >> > implementation
>> > > > >> > >> > does
>> > > > >> > >> > > > not
>> > > > >> > >> > > > > > support retry, I have been thinking about how that
>> > can
>> > > be
>> > > > >> > done,
>> > > > >> > >> > and I
>> > > > >> > >> > > > am
>> > > > >> > >> > > > > > pretty sure it can be.  We can keep the old
>> > > > authenticators
>> > > > >> on
>> > > > >> > >> the
>> > > > >> > >> > > > client
>> > > > >> > >> > > > > > and server sides and put them back into place if
>> the
>> > > > >> > >> > > re-authentication
>> > > > >> > >> > > > > > fails; we would also have to make sure the server
>> > side
>> > > > >> doesn't
>> > > > >> > >> > delay
>> > > > >> > >> > > > any
>> > > > >> > >> > > > > > failed re-authentication result and also doesn't
>> > close
>> > > > the
>> > > > >> > >> > connection
>> > > > >> > >> > > > > upon
>> > > > >> > >> > > > > > re-authentication failure.  I think I see how to
>> do
>> > all
>> > > > of
>> > > > >> > that.
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > There are some interactions between the above
>> > > > requirements.
>> > > > >> > If
>> > > > >> > >> > > > > > re-authentication can't start immediately and has
>> to
>> > > wait
>> > > > >> for
>> > > > >> > >> the
>> > > > >> > >> > > > > > connection to be used then that precludes
>> > interleaving
>> > > > >> because
>> > > > >> > >> we
>> > > > >> > >> > > can't
>> > > > >> > >> > > > > be
>> > > > >> > >> > > > > > sure that the credential will be active by the
>> time
>> > it
>> > > is
>> > > > >> used
>> > > > >> > >> --
>> > > > >> > >> > if
>> > > > >> > >> > > it
>> > > > >> > >> > > > > > isn't active, and we allow interleaving, then
>> > requests
>> > > > not
>> > > > >> > >> related
>> > > > >> > >> > to
>> > > > >> > >> > > > > > re-authentication will fail if server-side
>> > > > >> > >> > > > > > connection-close-due-to-expired-credential
>> > > > functionality is
>> > > > >> > in
>> > > > >> > >> > place.
>> > > > >> > >> > > > > Also
>> > > > >> > >> > > > > > note that any such server-side
>> > connection-close-due-to-
>> > > > >> > >> > > > > expired-credential
>> > > > >> > >> > > > > > functionality would likely have to avoid closing a
>> > > > >> connection
>> > > > >> > >> until
>> > > > >> > >> > > it
>> > > > >> > >> > > > is
>> > > > >> > >> > > > > > used for anything other than re-authentication --
>> it
>> > > must
>> > > > >> > allow
>> > > > >> > >> > > > > > re-authentication requests through when the
>> > credential
>> > > is
>> > > > >> > >> expired.
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > Given all of the above, it feels to me like the
>> > > low-level
>> > > > >> > >> solution
>> > > > >> > >> > is
>> > > > >> > >> > > > > > viable only under the following conditions:
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > 1) We must accept a delay in when
>> re-authentication
>> > > > occurs
>> > > > >> to
>> > > > >> > >> when
>> > > > >> > >> > > the
>> > > > >> > >> > > > > > connection is used
>> > > > >> > >> > > > > > 2) We must accept the bigger latency spikes
>> > associated
>> > > > with
>> > > > >> > not
>> > > > >> > >> > > > > > interleaving requests
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > Does this sound right to you, and if so, do you
>> find
>> > > > these
>> > > > >> > >> > conditions
>> > > > >> > >> > > > > > acceptable?  Or have I missed something and/or
>> made
>> > > > >> incorrect
>> > > > >> > >> > > > assumptions
>> > > > >> > >> > > > > > somewhere?
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > Ron
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > > On Fri, Sep 7, 2018 at 5:19 PM Ron Dagostino <
>> > > > >> > rndg...@gmail.com
>> > > > >> > >> >
>> > > > >> > >> > > > wrote:
>> > > > >> > >> > > > > >
>> > > > >> > >> > > > > >> Hi Rajini.  The code really helped me to
>> understand
>> > > what
>> > > > >> you
>> > > > >> > >> were
>> > > > >> > >> > > > > >> thinking -- thanks.  I'm still digesting, but
>> here
>> > are
>> > > > >> some
>> > > > >> > >> quick
>> > > > >> > >> > > > > >> observations:
>> > > > >> > >> > > > > >>
>> > > > >> > >> > > > > >> Your approach (I'll call it the "low-level"
>> > approach,
>> > > as
>> > > > >> > >> compared
>> > > > >> > >> > to
>> > > > >> > >> > > > the
>> > > > >> > >> > > > > >> existing PR, which works at a higher level of
>> > > > >> abstraction) --
>> > > > >> > >> the
>> > > > >> > >> > > > > low-level
>> > > > >> > >> > > > > >> approach is certainly intriguing.  The smaller
>> code
>> > > size
>> > > > >> is
>> > > > >> > >> > welcome,
>> > > > >> > >> > > > of
>> > > > >> > >> > > > > >> course, as is the fact that re-authentication
>> simply
>> > > > works
>> > > > >> > for
>> > > > >> > >> > > > everyone
>> > > > >> > >> > > > > >> regardless of the style of use (async vs. sync
>> I/O).
>> > > > >> > >> > > > > >>
>> > > > >> > >> > > > > >> I did notice that re-authentication of the
>> > connection
>> > > > >> starts
>> > > > >> > >> only
>> > > > >> > >> > > > > if/when
>> > > > >> > >> > > > > >> the client uses the connection.  For example, if
>> I
>> > > run a
>> > > > >> > >> console
>> > > > >> > >> > > > > producer,
>> > > > >> > >> > > > > >> re-authentication does not happen unless I try to
>> > > > produce
>> > > > >> > >> > something.
>> > > > >> > >> > > > On
>> > > > >> > >> > > > > >> the one hand this is potentially good -- if the
>> > client
>> > > > >> isn't
>> > > > >> > >> using
>> > > > >> > >> > > the
>> > > > >> > >> > > > > >> connection then the connection stays "silent" and
>> > > could
>> > > > be
>> > > > >> > >> closed
>> > > > >> > >> > on
>> > > > >> > >> > > > the
>> > > > >> > >> > > > > >> server side if it stays idle long enough --
>> whereas
>> > if
>> > > > we
>> > > > >> > start
>> > > > >> > >> > > > > injecting
>> > > > >> > >> > > > > >> re-authentication requests (as is done in the
>> > > high-level
>> > > > >> > >> approach)
>> > > > >> > >> > > > then
>> > > > >> > >> > > > > the
>> > > > >> > >> > > > > >> connection never goes completely silent and could
>> > (?)
>> > > > >> > >> potentially
>> > > > >> > >> > > > avoid
>> > > > >> > >> > > > > >> being closed due to idleness.
>> > > > >> > >> > > > > >>
>> > > > >> > >> > > > > >> However, if we implement sever-side kill of
>> > > connections
>> > > > >> using
>> > > > >> > >> > > expired
>> > > > >> > >> > > > > >> credentials (which we agree is needed), then we
>> > might
>> > > > end
>> > > > >> up
>> > > > >> > >> with
>> > > > >> > >> > > the
>> > > > >> > >> > > > > >> broker killing connections that are sitting idle
>> for
>> > > > only
>> > > > >> a
>> > > > >> > >> short
>> > > > >> > >> > > > > period of
>> > > > >> > >> > > > > >> time.  For example, if we refresh the token on
>> the
>> > > > client
>> > > > >> > side
>> > > > >> > >> and
>> > > > >> > >> > > > tell
>> > > > >> > >> > > > > the
>> > > > >> > >> > > > > >> connection that it is eligible to be
>> > re-authenticated,
>> > > > >> then
>> > > > >> > it
>> > > > >> > >> is
>> > > > >> > >> > > > > >> conceivable that the connection might be sitting
>> > idle
>> > > at
>> > > > >> that
>> > > > >> > >> > point
>> > > > >> > >> > > > and
>> > > > >> > >> > > > > >> might not be used until after the token it is
>> > > currently
>> > > > >> using
>> > > > >> > >> > > expires.
>> > > > >> > >> > > > > The
>> > > > >> > >> > > > > >> server might kill the connection, and that would
>> > force
>> > > > the
>> > > > >> > >> client
>> > > > >> > >> > to
>> > > > >> > >> > > > > >> re-connect with a new connection (requiring TLS
>> > > > >> negotiation).
>> > > > >> > >> The
>> > > > >> > >> > > > > >> probability of this happening increases as the
>> token
>> > > > >> lifetime
>> > > > >> > >> > > > > decreases, of
>> > > > >> > >> > > > > >> course, and it can be offset by decreasing the
>> > window
>> > > > >> factor
>> > > > >> > >> (i.e.
>> > > > >> > >> > > > make
>> > > > >> > >> > > > > it
>> > > > >> > >> > > > > >> eligible for re-authenticate at 50% of the
>> lifetime
>> > > > rather
>> > > > >> > than
>> > > > >> > >> > 80%,
>> > > > >> > >> > > > for
>> > > > >> > >> > > > > >> example -- it would have to sit idle for longer
>> > before
>> > > > the
>> > > > >> > >> server
>> > > > >> > >> > > > tried
>> > > > >> > >> > > > > to
>> > > > >> > >> > > > > >> kill it).  We haven't implemented server-side
>> kill
>> > > yet,
>> > > > so
>> > > > >> > >> maybe
>> > > > >> > >> > we
>> > > > >> > >> > > > > could
>> > > > >> > >> > > > > >> make it intelligent and only kill the connection
>> if
>> > it
>> > > > is
>> > > > >> > used
>> > > > >> > >> > (for
>> > > > >> > >> > > > > >> anything except re-authentication) after the
>> > > expiration
>> > > > >> > time...
>> > > > >> > >> > > > > >>
>> > > > >> > >> > > > > >> I also wonder about the ability to add retry into
>> > the
>> > > > >> > low-level
>> > > > >> > >> > > > > >> approach.  Do you think it would be possible?  It
>> > > > doesn't
>> > > > >> > seem
>> > > > >> > >> > like
>> > > > >> > >> > > it
>> > > > >> > >> > > > > to
>> > > > >> > >> > > > > >> me -- at least not without some big changes that
>> > would
>> > > > >> > >> eliminate
>> > > > >> > >> > > some
>> > > > >> > >> > > > of
>> > > > >> > >> > > > > >> the advantage of being able to reuse the existing
>> > > > >> > >> authentication
>> > > > >> > >> > > code.
>> > > > >> > >> > > > > The
>> > > > >> > >> > > > > >> reason I ask is because I think retry is
>> necessary.
>> > > It
>> > > > is
>> > > > >> > >> part of
>> > > > >> > >> > > how
>> > > > >> > >> > > > > >> refreshes work for both GSSAPI and OAUTHBEARER --
>> > they
>> > > > >> > refresh
>> > > > >> > >> > based
>> > > > >> > >> > > > on
>> > > > >> > >> > > > > >> some window factor (i.e. 80%) and implement
>> periodic
>> > > > retry
>> > > > >> > upon
>> > > > >> > >> > > > failure
>> > > > >> > >> > > > > so
>> > > > >> > >> > > > > >> that they can maximize the chances of having a
>> new
>> > > > >> credential
>> > > > >> > >> > > > available
>> > > > >> > >> > > > > for
>> > > > >> > >> > > > > >> any new connection attempt.  Without refresh we
>> > could
>> > > > end
>> > > > >> up
>> > > > >> > in
>> > > > >> > >> > the
>> > > > >> > >> > > > > >> situation where the connection still has some
>> > lifetime
>> > > > >> left
>> > > > >> > >> (10%,
>> > > > >> > >> > > 20%,
>> > > > >> > >> > > > > or
>> > > > >> > >> > > > > >> whatever) but it tries to re-authenticate and
>> cannot
>> > > > >> through
>> > > > >> > no
>> > > > >> > >> > > fault
>> > > > >> > >> > > > of
>> > > > >> > >> > > > > >> its own (i.e. token endpoint down, some Kerberos
>> > > > failure,
>> > > > >> > etc.)
>> > > > >> > >> > -=-
>> > > > >> > >> > > > the
>> > > > >> > >> > > > > >> connection is closed at that point, and it is
>> then
>> > > > unable
>> > > > >> to
>> > > > >> > >> > > reconnect
>> > > > >> > >> > > > > >> because of the same temporary problem.  We could
>> end
>> > > up
>> > > > >> with
>> > > > >> > an
>> > > > >> > >> > > > > especially
>> > > > >> > >> > > > > >> ill-timed, temporary outage in some non-Kafka
>> system
>> > > > >> (related
>> > > > >> > >> to
>> > > > >> > >> > > OAuth
>> > > > >> > >> > > > > or
>> > > > >> > >> > > > > >> Kerberos, or some LDAP directory) causing all
>> > clients
>> > > to
>> > > > >> be
>> > > > >> > >> kicked
>> > > > >> > >> > > off
>> > > > >> > >> > > > > the
>> > > > >> > >> > > > > >> cluster.  Retry capability seems to me to be the
>> way
>> > > to
>> > > > >> > >> mitigate
>> > > > >> > >> > > this
>> > > > >> > >> > > > > risk.
>> > > > >> > >> > > > > >>
>> > > > >> > >> > > > > >> Anyway, that's it for now.  I really like the
>> > approach
>> > > > you
>> > > > >> > >> > outlined
>> > > > >> > >> > > --
>> > > > >> > >> > > > > at
>> > > > >> > >> > > > > >> least at this point based on my current
>> > understanding.
>> > > > I
>> > > > >> > will
>> > > > >> > >> > > > continue
>> > > > >> > >> > > > > to
>> > > > >> > >> > > > > >> dig in, and I may send more comments/questions.
>> But
>> > > for
>> > > > >> > now, I
>> > > > >> > >> > > think
>> > > > >> > >> > > > > the
>> > > > >> > >> > > > > >> lack of retry -- and my definitely-could-be-wrong
>> > > sense
>> > > > >> that
>> > > > >> > it
>> > > > >> > >> > > can't
>> > > > >> > >> > > > > >> easily be added -- is my biggest concern with
>> this
>> > > > >> low-level
>> > > > >> > >> > > approach.
>> > > > >> > >> > > > > >>
>> > > > >> > >> > > > > >> Ron
>> > > >
>
>

Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Reply via email to