Hey Anna, thanks for the KIP. Will this change be applied as one type of quota violation, which for client side should be retriable? For EOS model before 2.6, the Streams client creates one producer for each input partition, so it is actually possible to create thousands of producers when the service is up. Just want to clarify what's the expected behavior to be seen on the client side?
On Mon, May 18, 2020 at 12:04 PM Anna Povzner <a...@confluent.io> wrote: > Hi Alexandre, > > Thanks for your comments. My answers are below: > > 900. The KIP does not propose any new metrics because we already have > metrics that will let us monitor connection attempts and the amount of time > the broker delays accepting new connections: > 1. We have a per-listener (and per-processor) metric for connection > creation rate: > > kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate > 2. We have per-listener metrics that track the amount of time Acceptor is > blocked from accepting connections: > > kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name} > Note that adding per IP JMX metrics may end up adding a lot of overhead, > especially for clusters with a large number of clients and many different > IP addresses. If we ever want to add the metric, perhaps we could propose a > separate KIP, but that would require some more evaluation of potential > overhead. > > 901. Yes, I updated the wiki with the approach for enforcing per IP limits > (not dropping right away), as I described in my response to Rajini. > > 902. Any additional stress testing is always super useful. I am going to > have PR with the first half of the KIP ready soon (broker-wider and > per-listener limits). Perhaps it could be worthwhile to see if it makes > sense to add stress testing to muckrake tests. Also, check out connection > stress workloads in Trogdor and whether they are sufficient or could be > extended: > > https://github.com/apache/kafka/tree/trunk/tools/src/main/java/org/apache/kafka/trogdor/workload > > Regards, > Anna > > On Mon, May 18, 2020 at 8:57 AM Rajini Sivaram <rajinisiva...@gmail.com> > wrote: > > > Hi Anna, > > > > Thanks for the response, sounds good. > > > > Regards, > > > > Rajini > > > > > > On Sun, May 17, 2020 at 1:38 AM Anna Povzner <a...@confluent.io> wrote: > > > > > Hi Rajini, > > > > > > Thanks for reviewing the KIP! > > > > > > I agree with your suggestion to make per-IP connection rate quota a > > dynamic > > > quota for entity name IP. This will allow configuring connection rate > > for a > > > particular IP as well. I updated the wiki accordingly. > > > > > > Your second concern makes sense -- rejecting the connection right away > > will > > > likely cause a new connection from the same client. I am concerned > about > > > delaying new connections for processing later, because if the > connections > > > keep coming with the high rate, there may be potentially a large > backlog > > > and connections may start timing out before the broker gets to > processing > > > them. For example, if clients come through proxy, there may be > > > potentially a large number of incoming connections with the same IP. > > > > > > What do you think about the following option: > > > * Once per-IP connection rate reaches the limit, accept or drop (clean > > up) > > > the connection after a delay depending on whether the quota is still > > > violated. We could re-use the mechanism implemented with KIP-306 where > > the > > > broker delays the response for failed client authentication. The delay > > will > > > be set to min(delay calculated based on the rate quota, 1 second), > which > > > matches the max delay for request quota. > > > > > > I think this option is somewhat your suggestion with delaying accepting > > per > > > IP connections that reached the rate limit, but with protection in > place > > to > > > make sure the number of delayed connections does not blow up. What do > you > > > think? > > > > > > Thanks, > > > Anna > > > > > > On Sat, May 16, 2020 at 1:09 AM Alexandre Dupriez < > > > alexandre.dupr...@gmail.com> wrote: > > > > > > > Hi Anna, > > > > > > > > Thank you for your answers and explanations. > > > > > > > > A couple of additional comments: > > > > > > > > 900. KIP-612 does not intend to dedicate a metric to the throttling > of > > > > incoming connections. I wonder if such a metric would be handy for > > > > monitoring and help set-up metric-based alarming if one wishes to > > > > capture this type of incident? > > > > > > > > 901. Following-up on Rajini's point 2 above - from my understanding, > > > > this new quota should prevent excess CPU consumption in > > > > Processor#run() method when a new connection has been accepted. > > > > Through the throttling in place, connections will be delayed as > > > > indicated by the KIP's specifications: > > > > > > > > " If connection creation rate on the broker exceeds the broker-wide > > > > limit, the broker will delay accepting a new connection by an amount > > > > of time that brings the rate within the limit." > > > > > > > > You may be referring to the following sentence: > > > > > > > > "A new broker configuration option will be added to limit the rate at > > > > which connections will be accepted for each IP address. New > > > > connections for the IP will be dropped once the limit is reached."? > > > > > > > > 902. It may be interesting to capture the data with and without > > > > connection throttling under stress scenarios. You may have these data > > > > already. If you need a pair of hands to do some stress tests once you > > > > have a POC or a PR, I am happy to contribute :) > > > > > > > > Thanks, > > > > Alexandre > > > > > > > > Le ven. 15 mai 2020 à 12:22, Rajini Sivaram <rajinisiva...@gmail.com > > > > a > > > > écrit : > > > > > > > > > > Hi Anna, > > > > > > > > > > Thanks for the KIP, looks good overall. A couple of comments about > > > per-IP > > > > > connection quotas: > > > > > > > > > > 1) Should we consider making per-IP quota similar to other quotas? > > > > > Configured as a dynamic quota for entity type IP, with per-IP limit > > as > > > > well > > > > > as defaults? Perhaps that would fit better rather than configs? > > > > > > > > > > 2) The current proposal drops connections after accepting > connections > > > for > > > > > per-IP limit. We do this in other cases too, but in this case, > should > > > we > > > > > throttle instead? My point is what is the quota protecting? If we > > want > > > to > > > > > limit rate of accepted connections, then accepting a connection and > > > then > > > > > dropping doesn't really help since that IP is going to reconnect. > If > > we > > > > > want to rate limit what happens next, i.e. authentication, then > > > > > throttling the accepted connection so its processing is delayed > would > > > > > perhaps be better? > > > > > > > > > > Regards, > > > > > > > > > > Rajini > > > > > > > > > > On Thu, May 14, 2020 at 4:12 PM David Jacot <dja...@confluent.io> > > > wrote: > > > > > > > > > > > Hi Anna, > > > > > > > > > > > > Thanks for your answers and the updated KIP. Looks good to me! > > > > > > > > > > > > Best, > > > > > > David > > > > > > > > > > > > On Thu, May 14, 2020 at 12:54 AM Anna Povzner <a...@confluent.io > > > > > > wrote: > > > > > > > > > > > > > I updated the KIP to add a new broker configuration to limit > > > > connection > > > > > > > creation rate per IP: max.connection.creation.rate.per.ip. Once > > the > > > > limit > > > > > > > is reached for a particular IP address, the broker will reject > > the > > > > > > > connection from that IP (close the connection it accepted) and > > > > continue > > > > > > > rejecting them until the rate is back within the rate limit. > > > > > > > > > > > > > > On Wed, May 13, 2020 at 11:46 AM Anna Povzner < > a...@confluent.io > > > > > > > wrote: > > > > > > > > > > > > > > > Hi David and Alexandre, > > > > > > > > > > > > > > > > Thanks so much for your feedback! Here are my answers: > > > > > > > > > > > > > > > > 1. Yes, we have seen several cases of clients that create a > new > > > > > > > connection > > > > > > > > per produce/consume request. One hypothesis is someone who is > > > used > > > > to > > > > > > > > connection pooling may accidentally write a Kafka client that > > > > creates a > > > > > > > new > > > > > > > > connection every time. > > > > > > > > > > > > > > > > 2 & 4. That's a good point I haven't considered. I think it > > makes > > > > sense > > > > > > > to > > > > > > > > provide an ability to limit connection creations per IP as > > well. > > > > This > > > > > > is > > > > > > > > not hard to implement -- the broker already keeps track of > the > > > > number > > > > > > of > > > > > > > > connections per IP, and immediately closes a new connection > if > > it > > > > comes > > > > > > > > from an IP that reached the connection limit. So, we could > > > > additionally > > > > > > > > track the rate, and close the connection from IP that exceeds > > the > > > > rate. > > > > > > > One > > > > > > > > slight concern is whether keeping track of per IP rates and > > > quotas > > > > adds > > > > > > > > overhead (CPU and memory). But perhaps it is not a problem if > > we > > > > use > > > > > > > > expiring sensors. > > > > > > > > > > > > > > > > It would still make sense to limit the overall connection > > > creation > > > > rate > > > > > > > > for the Kafka clusters which are shared among many different > > > > > > > > applications/clients, since they may spike at the same time > > > > bringing > > > > > > the > > > > > > > > total rate too high. > > > > > > > > > > > > > > > > 3. Controlling connection queue sizes only controls the share > > of > > > > time > > > > > > > > network threads use for creating new connections (and > accepting > > > on > > > > > > > Acceptor > > > > > > > > thread) vs. doing other work on each Processor iteration. It > > does > > > > not > > > > > > > > directly control how processing connection creations would be > > > > related > > > > > > to > > > > > > > > other processing done by brokers like on request handler > > threads. > > > > So, > > > > > > > while > > > > > > > > controlling queue size may mitigate the issue for some of the > > > > > > workloads, > > > > > > > it > > > > > > > > does not guarantee that. Plus, if we want to limit how many > > > > connections > > > > > > > are > > > > > > > > created per IP, the queue size approach would not work, > unless > > we > > > > go > > > > > > > with a > > > > > > > > "share" of the queue, which I think even further obscures > what > > > that > > > > > > > setting > > > > > > > > means (and what we would achieve as an end result). Does this > > > > answer > > > > > > the > > > > > > > > question? > > > > > > > > > > > > > > > > If there are no objections, I will update the KIP to add per > IP > > > > > > > connection > > > > > > > > rate limits (config and enforcement). > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Anna > > > > > > > > > > > > > > > > > > > > > > > > On Tue, May 12, 2020 at 11:25 AM Alexandre Dupriez < > > > > > > > > alexandre.dupr...@gmail.com> wrote: > > > > > > > > > > > > > > > >> Hello, > > > > > > > >> > > > > > > > >> Thank you for the KIP. > > > > > > > >> > > > > > > > >> I experienced in the past genuine broker brownouts due to > > > > connection > > > > > > > >> storms consuming most of the CPU available on the server > and I > > > > think > > > > > > > >> it is useful to protect against it. > > > > > > > >> > > > > > > > >> I tend to share the questions asked in points 2 and 4 from > > > David. > > > > Is > > > > > > > >> there still a risk of denial of service if the limit applies > > at > > > > the > > > > > > > >> listener-level without differentiating between (an) > > “offending” > > > > > > > >> client(s) and the others? > > > > > > > >> > > > > > > > >> To rebound on point 3 - conceptually one difference between > > > > capping > > > > > > > >> the queue size or throttling as presented in the KIP would > > come > > > > from > > > > > > > >> the time it takes to accept a connection and how that time > > > evolves > > > > > > > >> with the connection rate. > > > > > > > >> Assuming that that time increases monotonically with > resource > > > > > > > >> utilization, the admissible rate of connections would > decrease > > > as > > > > the > > > > > > > >> server becomes more loaded, if the limit was set on queue > > size. > > > > > > > >> > > > > > > > >> Thanks, > > > > > > > >> Alexandre > > > > > > > >> > > > > > > > >> Le mar. 12 mai 2020 à 08:49, David Jacot < > dja...@confluent.io > > > > > > a > > > > > > écrit > > > > > > > : > > > > > > > >> > > > > > > > > >> > Hi Anna, > > > > > > > >> > > > > > > > > >> > Thanks for the KIP! I have few questions: > > > > > > > >> > > > > > > > > >> > 1. You mention that some clients may create a new > > connections > > > > for > > > > > > each > > > > > > > >> > requests: "Another example is clients that create a new > > > > connection > > > > > > for > > > > > > > >> each > > > > > > > >> > produce/consume request". I am curious here but do we know > > any > > > > > > clients > > > > > > > >> > behaving like this? > > > > > > > >> > > > > > > > > >> > 2. I am a bit concerned by the impact of misbehaving > clients > > > on > > > > the > > > > > > > >> other > > > > > > > >> > ones. Let's say that we define a quota of 10 connections / > > sec > > > > for a > > > > > > > >> broker > > > > > > > >> > and that we have a misbehaving application constantly > trying > > > to > > > > > > create > > > > > > > >> 20 > > > > > > > >> > connections on that broker. That application will > constantly > > > > hit the > > > > > > > >> quota > > > > > > > >> > and > > > > > > > >> > always have many pending connections in the queue waiting > to > > > be > > > > > > > >> accepted. > > > > > > > >> > Regular clients trying to connect would need to wait until > > all > > > > the > > > > > > > >> pending > > > > > > > >> > connections upfront in the queue are drained in the best > > case > > > > > > scenario > > > > > > > >> or > > > > > > > >> > won't be able to connect at all in the worst case scenario > > if > > > > the > > > > > > > queue > > > > > > > >> is > > > > > > > >> > full. > > > > > > > >> > Does it sound like a valid concern? How do you see this? > > > > > > > >> > > > > > > > > >> > 3. As you mention it in the KIP, we use bounded queues > which > > > > already > > > > > > > >> limit > > > > > > > >> > the maximum number of connections that can be accepted. I > > > > wonder if > > > > > > we > > > > > > > >> > could reach the same goal by making the size of the queue > > > > > > > configurable. > > > > > > > >> > > > > > > > > >> > 4. Did you consider doing something similar to the > > connections > > > > quota > > > > > > > >> which > > > > > > > >> > limits the number of connections per IP? Instead of rate > > > > limiting > > > > > > all > > > > > > > >> the > > > > > > > >> > creation, > > > > > > > >> > we could perhaps rate limit the number of creation per IP > as > > > > well. > > > > > > > That > > > > > > > >> > could > > > > > > > >> > perhaps reduce the effect on the other clients. That may > be > > > > harder > > > > > > to > > > > > > > >> > implement > > > > > > > >> > though. > > > > > > > >> > > > > > > > > >> > Best, > > > > > > > >> > David > > > > > > > >> > > > > > > > > >> > On Mon, May 11, 2020 at 7:58 PM Anna Povzner < > > > a...@confluent.io > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > >> > > Hi, > > > > > > > >> > > > > > > > > > >> > > I just created KIP-612 to allow limiting connection > > creation > > > > rate > > > > > > on > > > > > > > >> > > brokers, and would like to start a discussion. > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers > > > > > > > >> > > > > > > > > > >> > > Feedback and suggestions are welcome! > > > > > > > >> > > > > > > > > > >> > > Thanks, > > > > > > > >> > > Anna > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >