Thanks everyone for the feedback. I will start a voting thread tomorrow morning if there are no more comments.
Regards, Anna On Mon, May 18, 2020 at 2:06 PM Anna Povzner <a...@confluent.io> wrote: > Hi Boyang, > > This KIP does not change the protocol with clients. The behavior is the > same as with KIP-402 where the broker delays accepting new connections when > the limit for the number of connections is reached. This KIP adds another > reason for the delay (when the rate is reached). Similarly, when dropping a > connection when per-IP limit is reached, except this KIP delays the > response or may still accept the connection. Client may timeout on waiting > for connection, and retry. > > Thanks, > Anna > > On Mon, May 18, 2020 at 12:54 PM Boyang Chen <reluctanthero...@gmail.com> > wrote: > >> Hey Anna, >> >> thanks for the KIP. Will this change be applied as one type of quota >> violation, which for client side should be retriable? For EOS model before >> 2.6, the Streams client creates one producer for each input partition, so >> it is actually possible to create thousands of producers when the service >> is up. Just want to clarify what's the expected behavior to be seen on the >> client side? >> >> On Mon, May 18, 2020 at 12:04 PM Anna Povzner <a...@confluent.io> wrote: >> >> > Hi Alexandre, >> > >> > Thanks for your comments. My answers are below: >> > >> > 900. The KIP does not propose any new metrics because we already have >> > metrics that will let us monitor connection attempts and the amount of >> time >> > the broker delays accepting new connections: >> > 1. We have a per-listener (and per-processor) metric for connection >> > creation rate: >> > >> > >> kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate >> > 2. We have per-listener metrics that track the amount of time Acceptor >> is >> > blocked from accepting connections: >> > >> > >> kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name} >> > Note that adding per IP JMX metrics may end up adding a lot of overhead, >> > especially for clusters with a large number of clients and many >> different >> > IP addresses. If we ever want to add the metric, perhaps we could >> propose a >> > separate KIP, but that would require some more evaluation of potential >> > overhead. >> > >> > 901. Yes, I updated the wiki with the approach for enforcing per IP >> limits >> > (not dropping right away), as I described in my response to Rajini. >> > >> > 902. Any additional stress testing is always super useful. I am going to >> > have PR with the first half of the KIP ready soon (broker-wider and >> > per-listener limits). Perhaps it could be worthwhile to see if it makes >> > sense to add stress testing to muckrake tests. Also, check out >> connection >> > stress workloads in Trogdor and whether they are sufficient or could be >> > extended: >> > >> > >> https://github.com/apache/kafka/tree/trunk/tools/src/main/java/org/apache/kafka/trogdor/workload >> > >> > Regards, >> > Anna >> > >> > On Mon, May 18, 2020 at 8:57 AM Rajini Sivaram <rajinisiva...@gmail.com >> > >> > wrote: >> > >> > > Hi Anna, >> > > >> > > Thanks for the response, sounds good. >> > > >> > > Regards, >> > > >> > > Rajini >> > > >> > > >> > > On Sun, May 17, 2020 at 1:38 AM Anna Povzner <a...@confluent.io> >> wrote: >> > > >> > > > Hi Rajini, >> > > > >> > > > Thanks for reviewing the KIP! >> > > > >> > > > I agree with your suggestion to make per-IP connection rate quota a >> > > dynamic >> > > > quota for entity name IP. This will allow configuring connection >> rate >> > > for a >> > > > particular IP as well. I updated the wiki accordingly. >> > > > >> > > > Your second concern makes sense -- rejecting the connection right >> away >> > > will >> > > > likely cause a new connection from the same client. I am concerned >> > about >> > > > delaying new connections for processing later, because if the >> > connections >> > > > keep coming with the high rate, there may be potentially a large >> > backlog >> > > > and connections may start timing out before the broker gets to >> > processing >> > > > them. For example, if clients come through proxy, there may be >> > > > potentially a large number of incoming connections with the same IP. >> > > > >> > > > What do you think about the following option: >> > > > * Once per-IP connection rate reaches the limit, accept or drop >> (clean >> > > up) >> > > > the connection after a delay depending on whether the quota is still >> > > > violated. We could re-use the mechanism implemented with KIP-306 >> where >> > > the >> > > > broker delays the response for failed client authentication. The >> delay >> > > will >> > > > be set to min(delay calculated based on the rate quota, 1 second), >> > which >> > > > matches the max delay for request quota. >> > > > >> > > > I think this option is somewhat your suggestion with delaying >> accepting >> > > per >> > > > IP connections that reached the rate limit, but with protection in >> > place >> > > to >> > > > make sure the number of delayed connections does not blow up. What >> do >> > you >> > > > think? >> > > > >> > > > Thanks, >> > > > Anna >> > > > >> > > > On Sat, May 16, 2020 at 1:09 AM Alexandre Dupriez < >> > > > alexandre.dupr...@gmail.com> wrote: >> > > > >> > > > > Hi Anna, >> > > > > >> > > > > Thank you for your answers and explanations. >> > > > > >> > > > > A couple of additional comments: >> > > > > >> > > > > 900. KIP-612 does not intend to dedicate a metric to the >> throttling >> > of >> > > > > incoming connections. I wonder if such a metric would be handy for >> > > > > monitoring and help set-up metric-based alarming if one wishes to >> > > > > capture this type of incident? >> > > > > >> > > > > 901. Following-up on Rajini's point 2 above - from my >> understanding, >> > > > > this new quota should prevent excess CPU consumption in >> > > > > Processor#run() method when a new connection has been accepted. >> > > > > Through the throttling in place, connections will be delayed as >> > > > > indicated by the KIP's specifications: >> > > > > >> > > > > " If connection creation rate on the broker exceeds the >> broker-wide >> > > > > limit, the broker will delay accepting a new connection by an >> amount >> > > > > of time that brings the rate within the limit." >> > > > > >> > > > > You may be referring to the following sentence: >> > > > > >> > > > > "A new broker configuration option will be added to limit the >> rate at >> > > > > which connections will be accepted for each IP address. New >> > > > > connections for the IP will be dropped once the limit is >> reached."? >> > > > > >> > > > > 902. It may be interesting to capture the data with and without >> > > > > connection throttling under stress scenarios. You may have these >> data >> > > > > already. If you need a pair of hands to do some stress tests once >> you >> > > > > have a POC or a PR, I am happy to contribute :) >> > > > > >> > > > > Thanks, >> > > > > Alexandre >> > > > > >> > > > > Le ven. 15 mai 2020 à 12:22, Rajini Sivaram < >> rajinisiva...@gmail.com >> > > >> > > a >> > > > > écrit : >> > > > > > >> > > > > > Hi Anna, >> > > > > > >> > > > > > Thanks for the KIP, looks good overall. A couple of comments >> about >> > > > per-IP >> > > > > > connection quotas: >> > > > > > >> > > > > > 1) Should we consider making per-IP quota similar to other >> quotas? >> > > > > > Configured as a dynamic quota for entity type IP, with per-IP >> limit >> > > as >> > > > > well >> > > > > > as defaults? Perhaps that would fit better rather than configs? >> > > > > > >> > > > > > 2) The current proposal drops connections after accepting >> > connections >> > > > for >> > > > > > per-IP limit. We do this in other cases too, but in this case, >> > should >> > > > we >> > > > > > throttle instead? My point is what is the quota protecting? If >> we >> > > want >> > > > to >> > > > > > limit rate of accepted connections, then accepting a connection >> and >> > > > then >> > > > > > dropping doesn't really help since that IP is going to >> reconnect. >> > If >> > > we >> > > > > > want to rate limit what happens next, i.e. authentication, then >> > > > > > throttling the accepted connection so its processing is delayed >> > would >> > > > > > perhaps be better? >> > > > > > >> > > > > > Regards, >> > > > > > >> > > > > > Rajini >> > > > > > >> > > > > > On Thu, May 14, 2020 at 4:12 PM David Jacot < >> dja...@confluent.io> >> > > > wrote: >> > > > > > >> > > > > > > Hi Anna, >> > > > > > > >> > > > > > > Thanks for your answers and the updated KIP. Looks good to me! >> > > > > > > >> > > > > > > Best, >> > > > > > > David >> > > > > > > >> > > > > > > On Thu, May 14, 2020 at 12:54 AM Anna Povzner < >> a...@confluent.io >> > > >> > > > > wrote: >> > > > > > > >> > > > > > > > I updated the KIP to add a new broker configuration to limit >> > > > > connection >> > > > > > > > creation rate per IP: max.connection.creation.rate.per.ip. >> Once >> > > the >> > > > > limit >> > > > > > > > is reached for a particular IP address, the broker will >> reject >> > > the >> > > > > > > > connection from that IP (close the connection it accepted) >> and >> > > > > continue >> > > > > > > > rejecting them until the rate is back within the rate limit. >> > > > > > > > >> > > > > > > > On Wed, May 13, 2020 at 11:46 AM Anna Povzner < >> > a...@confluent.io >> > > > >> > > > > wrote: >> > > > > > > > >> > > > > > > > > Hi David and Alexandre, >> > > > > > > > > >> > > > > > > > > Thanks so much for your feedback! Here are my answers: >> > > > > > > > > >> > > > > > > > > 1. Yes, we have seen several cases of clients that create >> a >> > new >> > > > > > > > connection >> > > > > > > > > per produce/consume request. One hypothesis is someone >> who is >> > > > used >> > > > > to >> > > > > > > > > connection pooling may accidentally write a Kafka client >> that >> > > > > creates a >> > > > > > > > new >> > > > > > > > > connection every time. >> > > > > > > > > >> > > > > > > > > 2 & 4. That's a good point I haven't considered. I think >> it >> > > makes >> > > > > sense >> > > > > > > > to >> > > > > > > > > provide an ability to limit connection creations per IP as >> > > well. >> > > > > This >> > > > > > > is >> > > > > > > > > not hard to implement -- the broker already keeps track of >> > the >> > > > > number >> > > > > > > of >> > > > > > > > > connections per IP, and immediately closes a new >> connection >> > if >> > > it >> > > > > comes >> > > > > > > > > from an IP that reached the connection limit. So, we could >> > > > > additionally >> > > > > > > > > track the rate, and close the connection from IP that >> exceeds >> > > the >> > > > > rate. >> > > > > > > > One >> > > > > > > > > slight concern is whether keeping track of per IP rates >> and >> > > > quotas >> > > > > adds >> > > > > > > > > overhead (CPU and memory). But perhaps it is not a >> problem if >> > > we >> > > > > use >> > > > > > > > > expiring sensors. >> > > > > > > > > >> > > > > > > > > It would still make sense to limit the overall connection >> > > > creation >> > > > > rate >> > > > > > > > > for the Kafka clusters which are shared among many >> different >> > > > > > > > > applications/clients, since they may spike at the same >> time >> > > > > bringing >> > > > > > > the >> > > > > > > > > total rate too high. >> > > > > > > > > >> > > > > > > > > 3. Controlling connection queue sizes only controls the >> share >> > > of >> > > > > time >> > > > > > > > > network threads use for creating new connections (and >> > accepting >> > > > on >> > > > > > > > Acceptor >> > > > > > > > > thread) vs. doing other work on each Processor iteration. >> It >> > > does >> > > > > not >> > > > > > > > > directly control how processing connection creations >> would be >> > > > > related >> > > > > > > to >> > > > > > > > > other processing done by brokers like on request handler >> > > threads. >> > > > > So, >> > > > > > > > while >> > > > > > > > > controlling queue size may mitigate the issue for some of >> the >> > > > > > > workloads, >> > > > > > > > it >> > > > > > > > > does not guarantee that. Plus, if we want to limit how >> many >> > > > > connections >> > > > > > > > are >> > > > > > > > > created per IP, the queue size approach would not work, >> > unless >> > > we >> > > > > go >> > > > > > > > with a >> > > > > > > > > "share" of the queue, which I think even further obscures >> > what >> > > > that >> > > > > > > > setting >> > > > > > > > > means (and what we would achieve as an end result). Does >> this >> > > > > answer >> > > > > > > the >> > > > > > > > > question? >> > > > > > > > > >> > > > > > > > > If there are no objections, I will update the KIP to add >> per >> > IP >> > > > > > > > connection >> > > > > > > > > rate limits (config and enforcement). >> > > > > > > > > >> > > > > > > > > Thanks, >> > > > > > > > > >> > > > > > > > > Anna >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Tue, May 12, 2020 at 11:25 AM Alexandre Dupriez < >> > > > > > > > > alexandre.dupr...@gmail.com> wrote: >> > > > > > > > > >> > > > > > > > >> Hello, >> > > > > > > > >> >> > > > > > > > >> Thank you for the KIP. >> > > > > > > > >> >> > > > > > > > >> I experienced in the past genuine broker brownouts due to >> > > > > connection >> > > > > > > > >> storms consuming most of the CPU available on the server >> > and I >> > > > > think >> > > > > > > > >> it is useful to protect against it. >> > > > > > > > >> >> > > > > > > > >> I tend to share the questions asked in points 2 and 4 >> from >> > > > David. >> > > > > Is >> > > > > > > > >> there still a risk of denial of service if the limit >> applies >> > > at >> > > > > the >> > > > > > > > >> listener-level without differentiating between (an) >> > > “offending” >> > > > > > > > >> client(s) and the others? >> > > > > > > > >> >> > > > > > > > >> To rebound on point 3 - conceptually one difference >> between >> > > > > capping >> > > > > > > > >> the queue size or throttling as presented in the KIP >> would >> > > come >> > > > > from >> > > > > > > > >> the time it takes to accept a connection and how that >> time >> > > > evolves >> > > > > > > > >> with the connection rate. >> > > > > > > > >> Assuming that that time increases monotonically with >> > resource >> > > > > > > > >> utilization, the admissible rate of connections would >> > decrease >> > > > as >> > > > > the >> > > > > > > > >> server becomes more loaded, if the limit was set on queue >> > > size. >> > > > > > > > >> >> > > > > > > > >> Thanks, >> > > > > > > > >> Alexandre >> > > > > > > > >> >> > > > > > > > >> Le mar. 12 mai 2020 à 08:49, David Jacot < >> > dja...@confluent.io >> > > > >> > > > a >> > > > > > > écrit >> > > > > > > > : >> > > > > > > > >> > >> > > > > > > > >> > Hi Anna, >> > > > > > > > >> > >> > > > > > > > >> > Thanks for the KIP! I have few questions: >> > > > > > > > >> > >> > > > > > > > >> > 1. You mention that some clients may create a new >> > > connections >> > > > > for >> > > > > > > each >> > > > > > > > >> > requests: "Another example is clients that create a new >> > > > > connection >> > > > > > > for >> > > > > > > > >> each >> > > > > > > > >> > produce/consume request". I am curious here but do we >> know >> > > any >> > > > > > > clients >> > > > > > > > >> > behaving like this? >> > > > > > > > >> > >> > > > > > > > >> > 2. I am a bit concerned by the impact of misbehaving >> > clients >> > > > on >> > > > > the >> > > > > > > > >> other >> > > > > > > > >> > ones. Let's say that we define a quota of 10 >> connections / >> > > sec >> > > > > for a >> > > > > > > > >> broker >> > > > > > > > >> > and that we have a misbehaving application constantly >> > trying >> > > > to >> > > > > > > create >> > > > > > > > >> 20 >> > > > > > > > >> > connections on that broker. That application will >> > constantly >> > > > > hit the >> > > > > > > > >> quota >> > > > > > > > >> > and >> > > > > > > > >> > always have many pending connections in the queue >> waiting >> > to >> > > > be >> > > > > > > > >> accepted. >> > > > > > > > >> > Regular clients trying to connect would need to wait >> until >> > > all >> > > > > the >> > > > > > > > >> pending >> > > > > > > > >> > connections upfront in the queue are drained in the >> best >> > > case >> > > > > > > scenario >> > > > > > > > >> or >> > > > > > > > >> > won't be able to connect at all in the worst case >> scenario >> > > if >> > > > > the >> > > > > > > > queue >> > > > > > > > >> is >> > > > > > > > >> > full. >> > > > > > > > >> > Does it sound like a valid concern? How do you see >> this? >> > > > > > > > >> > >> > > > > > > > >> > 3. As you mention it in the KIP, we use bounded queues >> > which >> > > > > already >> > > > > > > > >> limit >> > > > > > > > >> > the maximum number of connections that can be >> accepted. I >> > > > > wonder if >> > > > > > > we >> > > > > > > > >> > could reach the same goal by making the size of the >> queue >> > > > > > > > configurable. >> > > > > > > > >> > >> > > > > > > > >> > 4. Did you consider doing something similar to the >> > > connections >> > > > > quota >> > > > > > > > >> which >> > > > > > > > >> > limits the number of connections per IP? Instead of >> rate >> > > > > limiting >> > > > > > > all >> > > > > > > > >> the >> > > > > > > > >> > creation, >> > > > > > > > >> > we could perhaps rate limit the number of creation per >> IP >> > as >> > > > > well. >> > > > > > > > That >> > > > > > > > >> > could >> > > > > > > > >> > perhaps reduce the effect on the other clients. That >> may >> > be >> > > > > harder >> > > > > > > to >> > > > > > > > >> > implement >> > > > > > > > >> > though. >> > > > > > > > >> > >> > > > > > > > >> > Best, >> > > > > > > > >> > David >> > > > > > > > >> > >> > > > > > > > >> > On Mon, May 11, 2020 at 7:58 PM Anna Povzner < >> > > > a...@confluent.io >> > > > > > >> > > > > > > > wrote: >> > > > > > > > >> > >> > > > > > > > >> > > Hi, >> > > > > > > > >> > > >> > > > > > > > >> > > I just created KIP-612 to allow limiting connection >> > > creation >> > > > > rate >> > > > > > > on >> > > > > > > > >> > > brokers, and would like to start a discussion. >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers >> > > > > > > > >> > > >> > > > > > > > >> > > Feedback and suggestions are welcome! >> > > > > > > > >> > > >> > > > > > > > >> > > Thanks, >> > > > > > > > >> > > Anna >> > > > > > > > >> > > >> > > > > > > > >> >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > >> > > > >> > > >> > >> >