Re: [DISCUSS] KIP-219 - Improve Quota Communication

Rajini Sivaram Mon, 06 Nov 2017 07:32:24 -0800

Hi Becket,

2. Perhaps it is fine, though it may be better to limit throttle time to
*connections.max.idle.ms* <http://connections.max.idle.ms>, which is
typically a rather large number.


3. The maximum CLOSE_WAIT time is currently bounded by *connections.max.idle.ms
<http://connections.max.idle.ms>*. It is a large value, but we will
close client
connections after that even if the connection has been muted. The KIP
proposes to disable idle check for throttled clients and that could be an
issue.

4. I agree that it is hard to come up with a limit for throttle time that
works for all cases. And this is particularly hard since we need a set of
timeouts, sizes and quota limits which all make sense together. I think
Jun's suggestion of basing this limit on the total metric window size makes
sense because we currently don't have any way of enforcing quotas beyond
that (since you can always create a new connection to bypass quotas for
which we dont have any previous metrics).


4a) For request quotas, we already limit throttle time to quota window size.


4b) For fetch requests, limiting throttle time to total metric window size
and returning a smaller amount of data in each response could work well. This
reduces spikes in traffic and would allow clients to heartbeat, rebalance
etc. The minimum achievable quota would then be based on maximum message
size rather than maximum request size. Perhaps that would be sufficient?


4c) For produce requests, I agree that just a bound on throttle time is not
sufficient.


   - We need a solution to improve flow control for well-behaved clients
   which currently rely entirely on broker's throttling. The KIP addresses
   this using co-operative clients that sleep for an unbounded throttle time.
   I feel this is not ideal since the result is traffic with a lot of spikes.
   Feedback from brokers to enable flow control in the client is a good idea,
   but clients with excessive throttle times should really have been
   configured with smaller batch sizes.
   - We need a solution to enforce smaller quotas to protect the broker
   from misbehaving clients. The KIP addresses this by muting channels for an
   unbounded time. This introduces problems of channels in CLOSE_WAIT. And
   doesn't really solve all issues with misbehaving clients since new
   connections can be created to bypass quotas.

I feel that it is a bit difficult to get this right without addressing the
issue with new connections that bypass quotas (issue in the current
implementation rather than the KIP).

Regards,

Rajini


On Sat, Nov 4, 2017 at 3:11 AM, Becket Qin <becket....@gmail.com> wrote:

> Thanks Rajini.
>
> 1. Good point. We do need to bump up the protocol version so that the new
> clients do not wait for another throttle time when they are talking to old
> brokers. I'll update the KIP.
>
> 2. That is true. But the client was not supposed to send request to the
> broker during that period anyways. So detecting the broker failure later
> seems fine?
>
> 3. Wouldn't the CLOSE_WAIT handler number be the same as the current state?
> Currently the broker will still mute the socket until it sends the response
> back. If the clients disconnect while they are being throttled, the closed
> socket will not be detected until the throttle time has passed.
>
> Jun also suggested to bound the time by metric.sample.window.ms in the
> ticket. I am not sure about the bound on throttle time. It seems a little
> difficult to come up with a good bound. If the bound is too large, it does
> not really help solve the various timeout issue we may face. If the bound
> is too low, the quota is essentially not honored. We may potentially treat
> different requests differently, but that seems too complicated and error
> prone.
>
> IMO, the key improvement we want to make is to tell the clients how long
> they will be throttled so the clients knows what happened so they can act
> accordingly instead of waiting naively. Muting the socket on the broker
> side is just in case of non-cooperating clients. For the existing clients,
> it seems this does not have much impact compare with what we have now.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Fri, Nov 3, 2017 at 3:09 PM, Rajini Sivaram <rajinisiva...@gmail.com>
> wrote:
>
> > Hi Becket,
> >
> > Thank you for the KIP. A few comments:
> >
> > 1.KIP says:  "*No public interface changes are needed. We only propose
> > behavior change on the broker side.*"
> >
> > But from the proposed changes, it sounds like clients will be updated to
> > wait for throttle-time before sending next response, and also not handle
> > idle disconnections during that time. Doesn't that mean that clients need
> > to know that the broker has sent the response before throttling,
> requiring
> > protocol/version change?
> >
> >
> > 2. At the moment, broker failures are detected by clients (and vice
> versa)
> > within connections.max.idle.ms. By removing this check for an unlimited
> > throttle time, failure detection could be delayed.
> >
> >
> > 3. KIP says  "*Since this subsequent request is not actually handled
> until
> > the broker unmutes the channel, the client can hit request.timeout.ms
> > <http://request.timeout.ms> and reconnect. However, this is no worse
> than
> > the current state.*"
> >
> > I think this could be worse than the current state because broker doesn't
> > detect and close the channel for an unlimited throttle time, while new
> > connections will get accepted. As a result, lot of connections could be
> in
> > CLOSE_WAIT state when throttle time is high.
> >
> >
> > Perhaps it is better to combine this KIP with a bound on throttle time?
> >
> >
> > Regards,
> >
> >
> > Rajini
> >
> >
> > On Fri, Nov 3, 2017 at 8:09 PM, Becket Qin <becket....@gmail.com> wrote:
> >
> > > Thanks for the comment, Jun,
> > >
> > > 1. Yes, you are right. This could also happen with the current quota
> > > mechanism because we are essentially muting the socket during throttle
> > > time. There might be two ways to solve this.
> > > A) use another socket to send heartbeat.
> > > B) let the GroupCoordinator know that the client will not heartbeat for
> > > some time.
> > > It seems the first solution is cleaner.
> > >
> > > 2. For consumer it seems returning an empty response is a better
> option.
> > In
> > > the producer case, if there is a spike in traffic. The brokers will see
> > > queued up requests, but that is hard to avoid unless we have connection
> > > level quota, which is a bigger change and may be easier to discuss it
> in
> > a
> > > separate KIP.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > > On Fri, Nov 3, 2017 at 10:28 AM, Jun Rao <j...@confluent.io> wrote:
> > >
> > > > Hi, Jiangjie,
> > > >
> > > > Thanks for bringing this up. A couple of quick thoughts.
> > > >
> > > > 1. If the throttle time is large, what can happen is that a consumer
> > > won't
> > > > be able to heart beat to the group coordinator frequent enough. In
> that
> > > > case, even with this KIP, it seems there could be frequent consumer
> > group
> > > > rebalances.
> > > >
> > > > 2. If we return a response immediately, for the consumer, do we
> return
> > > the
> > > > requested data or an empty response? If we do the former, it may not
> > > > protect against the case when there are multiple consumer instances
> > > > associated with the same user/clientid.
> > > >
> > > > Jun
> > > >
> > > > On Wed, Nov 1, 2017 at 9:53 AM, Becket Qin <becket....@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We would like to start the discussion on KIP-219.
> > > > >
> > > > > The KIP tries to improve quota throttling time communication
> between
> > > > > brokers and clients to avoid clients timeout in case of long
> > throttling
> > > > > time.
> > > > >
> > > > > The KIP link is following:
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 219+-+Improve+quota+
> > > > > communication
> > > > >
> > > > > Comments are welcome.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-219 - Improve Quota Communication

Reply via email to