Re: [DISCUSS] KIP-152 - Improve diagnostics for SASL authentication failures

Rajini Sivaram Mon, 21 Aug 2017 06:25:16 -0700

Hi Jun,

Thank you for the review.


1. Each token response will indicate success/failure. At the moment, the
broker simply closes the connection in the case of failure. With this KIP,
an empty token with an error_code indicating authentication failure will be
sent by the broker before closing the connection. I have made this clearer
in the KIP.

2. I have added a section on versioning of SaslAuthenticate and
SaslHandshake requests.


On Fri, Aug 18, 2017 at 2:47 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Rajini,
>
> Thanks for the KIP. Looks good overall. Just a couple of minor comments.
>
> 1. "The final message from the broker will indicate if authentication
> succeeded or failed." Are we doing something special for the final message?
> I thought each token response will indicate success or failure in the error
> code.
>
> 2. It would be useful to document how we plan to evolve
> SaslAuthenticateRequest
> in the future. For example, if we add a new error code in
> SaslAuthenticateResponse,
> should we bump up the version of SaslAuthenticateRequest? When do we need
> to bump up SaslHandshakeRequest?
>
> Jun
>
>
> On Fri, Aug 18, 2017 at 10:59 AM, Ismael Juma <ism...@juma.me.uk> wrote:
>
> > Thanks Rajini. The changes look good. We have to update the motivation to
> > take into account the improvements in 0.11:
> >
> > "With the current Kafka SASL implementation, broker closes the client
> > connection if SASL authentication fails. Clients see this as a connection
> > failure and do not get any feedback for the reason why the connection was
> > closed. Producers and consumers retry, attempting to create successful
> > connections, treating authentication failures as transient failures.
> There
> > are no log entries on the client-side to indicate that any of these
> > connection failures were due to authentication failure."
> >
> > 0.11.0.x logs a warning with a hint that the disconnect is likely related
> > to authentication.
> >
> > Ismael
> >
> > On Tue, Aug 15, 2017 at 9:23 PM, Rajini Sivaram <rajinisiva...@gmail.com
> >
> > wrote:
> >
> > > I have updated the KIP based on the discussions so far. It will be good
> > if
> > > we can get some more feedback so that this can be implemented for
> 1.0.0.
> > >
> > >
> > > Thanks,
> > >
> > > Rajini
> > >
> > >
> > > On Thu, May 4, 2017 at 10:22 PM, Ismael Juma <ism...@juma.me.uk>
> wrote:
> > >
> > > > Hi Rajini,
> > > >
> > > > I think we were talking about slightly different things. I was just
> > > > referring to the fact that there are cases where we throw an
> > > > AuthorizationException back to the user without retrying from various
> > > > methods (poll, commitSync, etc).
> > > >
> > > > As you said, my initial preference was for not retrying at all
> because
> > it
> > > > is what you want in the common case of a misconfigured application. I
> > > > hadn't considered credential updates for authenticators that rely on
> > > > eventual consistency. Thinking about it some more, it seems like this
> > > > should be solved by the authenticator implementation as well. For
> > > example,
> > > > it could refresh the cached data for a user if authentication failed
> (a
> > > > good implementation would be a bit more involved to avoid going to
> the
> > > > underlying data source too often).
> > > >
> > > > Given that, not retrying sounds good to me.
> > > >
> > > > Ismael
> > > >
> > > > On Thu, May 4, 2017 at 4:04 PM, Rajini Sivaram <
> > rajinisiva...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Ismael,
> > > > >
> > > > > I thought the blocking waits in the producer and consumer are
> always
> > > > > related to retrying for metadata. So an authorization exception
> that
> > > > > impacts this wait can only be due to Describe authorization
> failure -
> > > > that
> > > > > always retries?
> > > > >
> > > > > I agree that connecting to different brokers when authentication
> > fails
> > > > with
> > > > > one is not desirable. But I am not keen on retrying with a suitable
> > > > backoff
> > > > > until timeout either. Because that has the same problem as the
> > scenario
> > > > > that you described. The next metadata request could be to broker-1
> to
> > > > which
> > > > > authentication succeeds and subsequent produce/consume  to broker-0
> > > could
> > > > > still fail.
> > > > >
> > > > > How about we just fail fast if one authentication fails - I think
> > that
> > > is
> > > > > what you were suggesting in the first place? We don't need to
> > blackout
> > > > any
> > > > > nodes beyond the reconnect backoff interval. Applications can still
> > > retry
> > > > > if they want to. In the case of credential updates, it will be up
> to
> > > the
> > > > > application to retry. During regular operation, a misconfigured
> > > > application
> > > > > fails fast with a meaningful exception. What do you think?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > > >
> > > > > On Thu, May 4, 2017 at 3:01 PM, Ismael Juma <ism...@juma.me.uk>
> > wrote:
> > > > >
> > > > > > H Rajini,
> > > > > >
> > > > > > Comments inline.
> > > > > >
> > > > > > On Thu, May 4, 2017 at 2:29 PM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ismael,
> > > > > > >
> > > > > > > Thank you for reviewing the KIP.
> > > > > > >
> > > > > > > An authenticated client that is not authorized to access a
> topic
> > is
> > > > > never
> > > > > > > told that the operation was not authorized. This is to prevent
> > the
> > > > > client
> > > > > > > from finding out if the topic exists by sending an unauthorized
> > > > > request.
> > > > > > So
> > > > > > > in this case, the client will retry metadata requests with the
> > > > > configured
> > > > > > > backoff until it times out.
> > > > > >
> > > > > >
> > > > > > This is true if the user does not have Describe permission. If
> the
> > > user
> > > > > has
> > > > > > Describe access and no Read or Write access, then the user is
> > > informed
> > > > > that
> > > > > > the operation was not authorized.
> > > > > >
> > > > > >
> > > > > > > Another important distinction for authorization failures is
> that
> > > the
> > > > > > > connection is not terminated.
> > > > > > >
> > > > > > > For unauthenticated clients, we do want to inform the client
> that
> > > > > > > authentication failed. The connection is terminated by the
> > broker.
> > > > > > > Especially if the client is using SASL_SSL, we really do want
> to
> > > > avoid
> > > > > > > reconnections that result in unnecessary expensive handshakes.
> So
> > > we
> > > > > want
> > > > > > > to return an exception to the user with minimal retries.
> > > > > > >
> > > > > >
> > > > > > Agreed.
> > > > > >
> > > > > > I was thinking that it may be useful to try more than one broker
> > for
> > > > the
> > > > > > > case where brokers are being upgraded and some brokers haven't
> > yet
> > > > seen
> > > > > > the
> > > > > > > latest credentials. I suppose I was thinking that at the moment
> > we
> > > > keep
> > > > > > on
> > > > > > > retrying every broker forever in the consumer and suddenly if
> we
> > > stop
> > > > > > > retrying altogether, it could potentially lead to some
> unforeseen
> > > > > timing
> > > > > > > issues. Hence the suggestion to try every broker once.
> > > > > > >
> > > > > >
> > > > > > I see. Retrying forever is a side-effect of auto-topic creation,
> > but
> > > > it's
> > > > > > something we want to move away from. As mentioned, we actually
> > don't
> > > > > retry
> > > > > > at all if the user has Describe permission.
> > > > > >
> > > > > > Broker upgrades could be fixed by ensuring that the latest
> > > credentials
> > > > > are
> > > > > > loaded before the broker starts serving requests. More
> problematic
> > is
> > > > > > dealing with credential updates. This is another distinction when
> > > > > compared
> > > > > > to authorization.
> > > > > >
> > > > > > I am not sure if trying different brokers really helps us though.
> > > Say,
> > > > we
> > > > > > fail to authenticate with broker 0 and then we succeed with
> broker
> > 1.
> > > > > This
> > > > > > helps with metadata requests, but we will be in trouble when we
> try
> > > to
> > > > > > produce or consume to broker 0 (because it's the leader of some
> > > > > > partitions). So maybe we just want to retry with a suitable
> backoff
> > > > > until a
> > > > > > timeout?
> > > > > >
> > > > > > Yes, I agree that blacking out nodes forever isn't a good idea.
> > When
> > > we
> > > > > > > throw AuthenticationFailedException for the current operation
> or
> > if
> > > > > > > authentication to another broker succeeds, we can clear the
> > > blackout
> > > > so
> > > > > > > that any new request from the client can attempt reconnection
> > after
> > > > the
> > > > > > > reconnect backoff period as they do now.
> > > > > > >
> > > > > >
> > > > > > Yes, that would be better if we decide that connecting to
> different
> > > > > brokers
> > > > > > is worthwhile for the requests that can be sent to any broker.
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Thu, May 4, 2017 at 2:29 PM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ismael,
> > > > > > >
> > > > > > > Thank you for reviewing the KIP.
> > > > > > >
> > > > > > > An authenticated client that is not authorized to access a
> topic
> > is
> > > > > never
> > > > > > > told that the operation was not authorized. This is to prevent
> > the
> > > > > client
> > > > > > > from finding out if the topic exists by sending an unauthorized
> > > > > request.
> > > > > > So
> > > > > > > in this case, the client will retry metadata requests with the
> > > > > configured
> > > > > > > backoff until it times out. Another important distinction for
> > > > > > authorization
> > > > > > > failures is that the connection is not terminated.
> > > > > > >
> > > > > > > For unauthenticated clients, we do want to inform the client
> that
> > > > > > > authentication failed. The connection is terminated by the
> > broker.
> > > > > > > Especially if the client is using SASL_SSL, we really do want
> to
> > > > avoid
> > > > > > > reconnections that result in unnecessary expensive handshakes.
> So
> > > we
> > > > > want
> > > > > > > to return an exception to the user with minimal retries.
> > > > > > >
> > > > > > > I was thinking that it may be useful to try more than one
> broker
> > > for
> > > > > the
> > > > > > > case where brokers are being upgraded and some brokers haven't
> > yet
> > > > seen
> > > > > > the
> > > > > > > latest credentials. I suppose I was thinking that at the moment
> > we
> > > > keep
> > > > > > on
> > > > > > > retrying every broker forever in the consumer and suddenly if
> we
> > > stop
> > > > > > > retrying altogether, it could potentially lead to some
> unforeseen
> > > > > timing
> > > > > > > issues. Hence the suggestion to try every broker once.
> > > > > > >
> > > > > > > Yes, I agree that blacking out nodes forever isn't a good idea.
> > > When
> > > > we
> > > > > > > throw AuthenticationFailedException for the current operation
> or
> > if
> > > > > > > authentication to another broker succeeds, we can clear the
> > > blackout
> > > > so
> > > > > > > that any new request from the client can attempt reconnection
> > after
> > > > the
> > > > > > > reconnect backoff period as they do now.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Rajini
> > > > > > >
> > > > > > > On Thu, May 4, 2017 at 12:51 PM, Ismael Juma <
> ism...@juma.me.uk>
> > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Rajini. This is a good improvement. One question, the
> > > > proposal
> > > > > > > > states:
> > > > > > > >
> > > > > > > > Producer waitForMetadata and consumer ensureCoordinatorReady
> > will
> > > > be
> > > > > > > > > updated to throw AuthenticationFailedException if
> connections
> > > to
> > > > > all
> > > > > > > > > available brokers fail authentication.
> > > > > > > >
> > > > > > > >
> > > > > > > > Can you elaborate on the reason why we would treat
> > authentication
> > > > > > > failures
> > > > > > > > differently from authorization failures? It would be good to
> > > > > understand
> > > > > > > > under which scenario it would be beneficial to try all the
> > > brokers
> > > > > (it
> > > > > > > > seems that the proposal also suggests blacking out brokers
> > > > > permanently
> > > > > > if
> > > > > > > > we fail authentication, so that could also eventually cause
> > > > issues).
> > > > > > > >
> > > > > > > > Ismael
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, May 4, 2017 at 12:37 PM, Rajini Sivaram <
> > > > > > rajinisiva...@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I have created a KIP to improve diagnostics for SASL
> > > > authentication
> > > > > > > > > failures and reduce retries and blocking when
> authentication
> > > > fails:
> > > > > > > > >
> > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 152+-+
> > > > > > > > > Improve+diagnostics+for+SASL+authentication+failures
> > > > > > > > >
> > > > > > > > > Comments and suggestions are welcome.
> > > > > > > > >
> > > > > > > > > Thank you...
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Rajini
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-152 - Improve diagnostics for SASL authentication failures

Reply via email to