Assuming the intent was to migrate the Google Doc to the CEP, I took another look. I think there's some ambitious ideas here, and I appreciate any effort to improve Cassandra's stability. I think CASSANDRA-19534 <https://issues.apache.org/jira/browse/CASSANDRA-19534> was a massive step in the right direction, and a bit of what's in the doc complements the work that Alex did.
I think here, there's a case to be made for less is more. The doc lists several things that the system could use to apply a rate limiter, but I think the strongest signals are the queue depth of individual pools (NTR, Mutation, Read) and rate of timeouts. I'd like to suggest the first iteration of this should only look at those things, rather than trying to utilize system metrics. You can have high CPU usage with a low queue depth, and you can have low CPU usage with a deep queue. In the former example, there's no need to throttle, the system is just busy. In the second, there's a need to throttle. This is a common issue with say, LWTs, where the system hasn't been configured with a high enough concurrent_writes. I also think the CEP should be simplified further, by enforcing queue depth limits instead of a rate limit. Pending requests piling up have been a huge problem for system instability, which is why CASSANDRA-19534 <https://issues.apache.org/jira/browse/CASSANDRA-19534> is such a significant improvement - we're able to shed timed out requests much more aggressively than we were before which leads to faster recovery times. So my suggestion is to simplify things, so where we * use the rate of timeouts to limit the depth of the queues for each of the thread pools * reject requests when the queue is full with an OverloadedException. Ideally the drivers would back off and rate limit, because when you start rejecting requests that retry, you increase load on other parts of the system, which affects stability as well. If you want to follow this up with the ability to dynamically resize thread pools that could be interesting. I think that would be the right time to look a system resources, because if you have 100K pending reads and you're at 20% CPU w/ low disk latency, you can probably increase concurrent_reads. Jon On Thu, Sep 19, 2024 at 2:38 PM Benedict Elliott Smith <bened...@apache.org> wrote: > I just want to flag here that this is a topic I have strong opinions on, > but the CEP is not really specific or detailed enough to understand > precisely how it will be implemented. So, if a patch is already being > produced, most of my feedback is likely to be provided some time after a > patch appears, through the normal review process. I want to flag this now > to avoid any surprise. > > I will say that upfront that, ideally, this system should be designed to > have ~zero overhead when disabled, and with minimal coupling (between its > own components and C* itself), so that entirely orthogonal approaches can > be integrated in future without polluting the codebase. > > > On 19 Sep 2024, at 19:14, Patrick McFadin <pmcfa...@gmail.com> wrote: > > The work has begun but we don't have a VOTE thread for this CEP. Can one > get started? > > On Mon, May 6, 2024 at 9:24 PM Jaydeep Chovatia < > chovatia.jayd...@gmail.com> wrote: > >> Sure, Caleb. I will include the work as part of CASSANDRA-19534 >> <https://issues.apache.org/jira/browse/CASSANDRA-19534> in the CEP-41. >> >> Jaydeep >> >> On Fri, May 3, 2024 at 7:48 AM Caleb Rackliffe <calebrackli...@gmail.com> >> wrote: >> >>> FYI, there is some ongoing sort-of-related work going on in >>> CASSANDRA-19534 <https://issues.apache.org/jira/browse/CASSANDRA-19534> >>> >>> On Wed, Apr 10, 2024 at 6:35 PM Jaydeep Chovatia < >>> chovatia.jayd...@gmail.com> wrote: >>> >>>> Just created an official CEP-41 >>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-41+%28DRAFT%29+Apache+Cassandra+Unified+Rate+Limiter> >>>> incorporating the feedback from this discussion. Feel free to let me know >>>> if I may have missed some important feedback in this thread that is not >>>> captured in the CEP-41. >>>> >>>> Jaydeep >>>> >>>> On Thu, Feb 22, 2024 at 11:36 AM Jaydeep Chovatia < >>>> chovatia.jayd...@gmail.com> wrote: >>>> >>>>> Thanks, Josh. I will file an official CEP with all the details in a >>>>> few days and update this thread with that CEP number. >>>>> Thanks a lot everyone for providing valuable insights! >>>>> >>>>> Jaydeep >>>>> >>>>> On Thu, Feb 22, 2024 at 9:24 AM Josh McKenzie <jmcken...@apache.org> >>>>> wrote: >>>>> >>>>>> Do folks think we should file an official CEP and take it there? >>>>>> >>>>>> +1 here. >>>>>> >>>>>> Synthesizing your gdoc, Caleb's work, and the feedback from this >>>>>> thread into a draft seems like a solid next step. >>>>>> >>>>>> On Wed, Feb 7, 2024, at 12:31 PM, Jaydeep Chovatia wrote: >>>>>> >>>>>> I see a lot of great ideas being discussed or proposed in the past to >>>>>> cover the most common rate limiter candidate use cases. Do folks think we >>>>>> should file an official CEP and take it there? >>>>>> >>>>>> Jaydeep >>>>>> >>>>>> On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe < >>>>>> calebrackli...@gmail.com> wrote: >>>>>> >>>>>> I just remembered the other day that I had done a quick writeup on >>>>>> the state of compaction stress-related throttling in the project: >>>>>> >>>>>> >>>>>> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing >>>>>> >>>>>> I'm sure most of it is old news to the people on this thread, but I >>>>>> figured I'd post it just in case :) >>>>>> >>>>>> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie <jmcken...@apache.org> >>>>>> wrote: >>>>>> >>>>>> >>>>>> 2.) We should make sure the links between the "known" root causes of >>>>>> cascading failures and the mechanisms we introduce to avoid them remain >>>>>> very strong. >>>>>> >>>>>> Seems to me that our historical strategy was to address individual >>>>>> known cases one-by-one rather than looking for a more holistic >>>>>> load-balancing and load-shedding solution. While the engineer in me likes >>>>>> the elegance of a broad, more-inclusive *actual SEDA-like* approach, >>>>>> the pragmatist in me wonders how far we think we are today from a stable >>>>>> set-point. >>>>>> >>>>>> i.e. are we facing a handful of cases where nodes can still get >>>>>> pushed over and then cascade that we can surgically address, or are we >>>>>> facing a broader lack of back-pressure that rears its head in different >>>>>> domains (client -> coordinator, coordinator -> replica, internode with >>>>>> other operations, etc) at surprising times and should be considered more >>>>>> holistically? >>>>>> >>>>>> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote: >>>>>> >>>>>> I almost forgot CASSANDRA-15817, which introduced >>>>>> reject_repair_compaction_threshold, which provides a mechanism to stop >>>>>> repairs while compaction is underwater. >>>>>> >>>>>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe < >>>>>> calebrackli...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> Hey all, >>>>>> >>>>>> I'm a bit late to the discussion. I see that we've already discussed >>>>>> CASSANDRA-15013 >>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-15013> and >>>>>> CASSANDRA-16663 >>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16663> at least in >>>>>> passing. Having written the latter, I'd be the first to admit it's a >>>>>> crude >>>>>> tool, although it's been useful here and there, and provides a couple >>>>>> primitives that may be useful for future work. As Scott mentions, while >>>>>> it >>>>>> is configurable at runtime, it is not adaptive, although we did >>>>>> make configuration easier in CASSANDRA-17423 >>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17423>. It also is >>>>>> global to the node, although we've lightly discussed some ideas around >>>>>> making it more granular. (For example, keyspace-based limiting, or >>>>>> limiting >>>>>> "domains" tagged by the client in requests, could be interesting.) It >>>>>> also >>>>>> does not deal with inter-node traffic, of course. >>>>>> >>>>>> Something we've not yet mentioned (that does address internode >>>>>> traffic) is CASSANDRA-17324 >>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17324>, which I >>>>>> proposed shortly after working on the native request limiter (and have >>>>>> just >>>>>> not had much time to return to). The basic idea is this: >>>>>> >>>>>> When a node is struggling under the weight of a compaction backlog >>>>>> and becomes a cause of increased read latency for clients, we have two >>>>>> safety valves: >>>>>> >>>>>> >>>>>> 1.) Disabling the native protocol server, which stops the node from >>>>>> coordinating reads and writes. >>>>>> 2.) Jacking up the severity on the node, which tells the dynamic >>>>>> snitch to avoid the node for reads from other coordinators. >>>>>> >>>>>> >>>>>> These are useful, but we don’t appear to have any mechanism that >>>>>> would allow us to temporarily reject internode hint, batch, and mutation >>>>>> messages that could further delay resolution of the compaction backlog. >>>>>> >>>>>> >>>>>> Whether it's done as part of a larger framework or on its own, it >>>>>> still feels like a good idea. >>>>>> >>>>>> Thinking in terms of opportunity costs here (i.e. where we spend our >>>>>> finite engineering time to holistically improve the experience of >>>>>> operating >>>>>> this database) is healthy, but we probably haven't reached the point of >>>>>> diminishing returns on nodes being able to protect themselves from >>>>>> clients >>>>>> and from other nodes. I would just keep in mind two things: >>>>>> >>>>>> 1.) The effectiveness of rate-limiting in the system (which includes >>>>>> the database and all clients) as a whole necessarily decreases as we move >>>>>> from the application to the lowest-level database internals. Limiting >>>>>> correctly at the client will save more resources than limiting at the >>>>>> native protocol server, and limiting correctly at the native protocol >>>>>> server will save more resources than limiting after we've dispatched >>>>>> requests to some thread pool for processing. >>>>>> 2.) We should make sure the links between the "known" root causes of >>>>>> cascading failures and the mechanisms we introduce to avoid them remain >>>>>> very strong. >>>>>> >>>>>> In any case, I'd be happy to help out in any way I can as this moves >>>>>> forward (especially as it relates to our past/current attempts to address >>>>>> this problem space). >>>>>> >>>>>> >>>>>> >>>>>> >