Yes I think there are two policy issues but I don't think they are
separate/mutually exclusive for the purposes of this discussion - the
first being what should the broker do if quota is violated and second,
what should it return to the client (error code or status code). (The
separate discussion unrelated to quotas is in general do we want to
include status code with errors - e.g., ReplicaNotAvailable in topic
metadata response - and I think everyone agrees that we should not.)

When we first started this thread I was leaning more toward error code
and immediate response (so the client can react accordingly), but
after this discussion I'm no longer convinced about that. (The error
code is appropriate I think in this case since the request is actually
dropped due to a quota violation.) An "issue" with this is that the
broker cannot effectively protect itself against "simple" clients that
don't back off properly. I actually think this may not be a huge issue
though because regardless of quotas there needs to be lower-level
protection against DoS - i.e., this applies even for the hold and
respond approach. Something other than a Kafka client can flood a
network. If the policy is to reject the request and respond
immediately (or wait up to current request timeout) on quota violation
then an error code is appropriate (since the append was rejected).

With the second approach, (for producer request do an append, hold
request and then respond), an error code does not really make sense.
The main concern here is request timeout. I agree with Jay that if we
improve the semantics of timeout (possibly adding a separate request
timeout) then this approach would be less controversial. i.e., for
producer requests there should be two timeouts - replication timeout
and request timeout, the latter being very large. One nuance to this
is I think it should be a broker-side setting (not client-side) that
needs to be communicated to the client somehow since the client needs
to know in advance a ceiling on how long it can expect to wait for a
response. So if the request succeeds immediately or fails due to a
usual error (e.g., slow replica and therefore replication timeout) the
client will get a response within the replication timeout. Otherwise,
it may block until the full request timeout if quota is violated.

Both approaches ideally need some negotiation - in the first approach,
the client should ideally be told its current quota from which it can
estimate how long it should ideally back off. In the second approach,
the client needs to know how long a request may be held and the broker
enforce backoff up to this limit on quota violations. The latter seems
simpler for client implementation.

Thanks,

Joel

On Mon, Mar 16, 2015 at 10:58:02PM -0700, Guozhang Wang wrote:
> I think we are really discussing two separate issues here:
> 
> 1. Whether we should a) append-then-block-then-returnOKButThrottled or b)
> block-then-returnFailDuetoThrottled for quota actions on produce requests.
> 
> Both these approaches assume some kind of well-behaveness of the clients:
> option a) assumes the client sets an proper timeout value while can just
> ignore "OKButThrottled" response, while option b) assumes the client
> handles the "FailDuetoThrottled" appropriately. For any malicious clients
> that, for example, just keep retrying either intentionally or not, neither
> of these approaches are actually effective.
> 
> 2. For "OKButThrottled" and "FailDuetoThrottled" responses, shall we encode
> them as error codes or augment the protocol to use a separate field
> indicating "status codes".
> 
> Today we have already incorporated some status code as error codes in the
> responses, e.g. ReplicaNotAvailable in MetadataResponse, the pros of this
> is of course using a single field for response status like the HTTP status
> codes, while the cons is that it requires clients to handle the error codes
> carefully.
> 
> I think maybe we can actually extend the single-code approach to overcome
> its drawbacks, that is, wrap the error codes semantics to the users so that
> users do not need to handle the codes one-by-one. More concretely,
> following Jay's example the client could write sth. like this:
> 
> 
> -----------------
> 
>   if(error.isOK())
>      // status code is good or the code can be simply ignored for this
> request type, process the request
>   else if(error.needsRetry())
>      // throttled, transient error, etc: retry
>   else if(error.isFatal())
>      // non-retriable errors, etc: notify / terminate / other handling
> 
> -----------------
> 
> Only when the clients really want to handle, for example FailDuetoThrottled
> status code specifically, it needs to:
> 
>   if(error.isOK())
>      // status code is good or the code can be simply ignored for this
> request type, process the request
>   else if(error == FailDuetoThrottled )
>      // throttled: log it
>   else if(error.needsRetry())
>      // transient error, etc: retry
>   else if(error.isFatal())
>      // non-retriable errors, etc: notify / terminate / other handling
> 
> -----------------
> 
> And for implementation we can probably group the codes accordingly like
> HTTP status code such that we can do:
> 
> boolean Error.isOK() {
>   return code < 300 && code >= 200;
> }
> 
> Guozhang
> 
> On Mon, Mar 16, 2015 at 10:24 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
> 
> > Agreed that trying to shoehorn non-error codes into the error field is a
> > bad idea. It makes it *way* too easy to write code that looks (and should
> > be) correct but is actually incorrect. If necessary, I think it's much
> > better to to spend a couple of extra bytes to encode that information
> > separately (a "status" or "warning" section of the response). An indication
> > that throttling is occurring is something I'd expect to be indicated by a
> > bit flag in the response rather than as an error code.
> >
> > Gwen - I think an error code makes sense when the request actually failed.
> > Option B, which Jun was advocating, would have appended the messages
> > successfully. If the rate-limiting case you're talking about had
> > successfully committed the messages, I would say that's also a bad use of
> > error codes.
> >
> >
> > On Mon, Mar 16, 2015 at 10:16 PM, Gwen Shapira <gshap...@cloudera.com>
> > wrote:
> >
> > > We discussed an error code for rate-limiting (which I think made
> > > sense), isn't it a similar case?
> > >
> > > On Mon, Mar 16, 2015 at 10:10 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
> > > > My concern is that as soon as you start encoding non-error response
> > > > information into error codes the next question is what to do if two
> > such
> > > > codes apply (i.e. you have a replica down and the response is
> > quota'd). I
> > > > think I am trying to argue that error should mean "why we failed your
> > > > request", for which there will really only be one reason, and any other
> > > > useful information we want to send back is just another field in the
> > > > response.
> > > >
> > > > -Jay
> > > >
> > > > On Mon, Mar 16, 2015 at 9:51 PM, Gwen Shapira <gshap...@cloudera.com>
> > > wrote:
> > > >
> > > >> I think its not too late to reserve a set of error codes (200-299?)
> > > >> for "non-error" codes.
> > > >>
> > > >> It won't be backward compatible (i.e. clients that currently do "else
> > > >> throw" will throw on non-errors), but perhaps its worthwhile.
> > > >>
> > > >> On Mon, Mar 16, 2015 at 9:42 PM, Jay Kreps <jay.kr...@gmail.com>
> > wrote:
> > > >> > Hey Jun,
> > > >> >
> > > >> > I'd really really really like to avoid that. Having just spent a
> > > bunch of
> > > >> > time on the clients, using the error codes to encode other
> > information
> > > >> > about the response is super dangerous. The error handling is one of
> > > the
> > > >> > hardest parts of the client (Guozhang chime in here).
> > > >> >
> > > >> > Generally the error handling looks like
> > > >> >   if(error == none)
> > > >> >      // good, process the request
> > > >> >   else if(error == KNOWN_ERROR_1)
> > > >> >      // handle known error 1
> > > >> >   else if(error == KNOWN_ERROR_2)
> > > >> >      // handle known error 2
> > > >> >   else
> > > >> >      throw Errors.forCode(error).exception(); // or some other
> > default
> > > >> > behavior
> > > >> >
> > > >> > This works because we have a convention that and error is something
> > > that
> > > >> > prevented your getting the response so the default handling case is
> > > sane
> > > >> > and forward compatible. It is tempting to use the error code to
> > convey
> > > >> > information in the success case. For example we could use error
> > codes
> > > to
> > > >> > encode whether quotas were enforced, whether the request was served
> > > out
> > > >> of
> > > >> > cache, whether the stock market is up today, or whatever. The
> > problem
> > > is
> > > >> > that since these are not errors as far as the client is concerned it
> > > >> should
> > > >> > not throw an exception but process the response, but now we created
> > an
> > > >> > explicit requirement that that error be handled explicitly since it
> > is
> > > >> > different. I really think that this kind of information is not an
> > > error,
> > > >> it
> > > >> > is just information, and if we want it in the response we should do
> > > the
> > > >> > right thing and add a new field to the response.
> > > >> >
> > > >> > I think you saw the Samza bug that was literally an example of this
> > > >> > happening and leading to an infinite retry loop.
> > > >> >
> > > >> > Further more I really want to emphasize that hitting your quota in
> > the
> > > >> > design that Adi has proposed is actually not an error condition at
> > > all.
> > > >> It
> > > >> > is totally reasonable in any bootstrap situation to intentionally
> > > want to
> > > >> > run at the limit the system imposes on you.
> > > >> >
> > > >> > -Jay
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Mon, Mar 16, 2015 at 4:27 PM, Jun Rao <j...@confluent.io> wrote:
> > > >> >
> > > >> >> It's probably useful for a client to know whether its requests are
> > > >> >> throttled or not (e.g., for monitoring and alerting). From that
> > > >> >> perspective, option B (delay the requests and return an error)
> > seems
> > > >> >> better.
> > > >> >>
> > > >> >> Thanks,
> > > >> >>
> > > >> >> Jun
> > > >> >>
> > > >> >> On Wed, Mar 4, 2015 at 3:51 PM, Aditya Auradkar <
> > > >> >> aaurad...@linkedin.com.invalid> wrote:
> > > >> >>
> > > >> >> > Posted a KIP for quotas in kafka.
> > > >> >> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
> > > >> >> >
> > > >> >> > Appreciate any feedback.
> > > >> >> >
> > > >> >> > Aditya
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
> 
> 
> 
> -- 
> -- Guozhang

Reply via email to