I see a lot of great ideas being discussed or proposed in the past to cover
the most common rate limiter candidate use cases. Do folks think we should
file an official CEP and take it there?

Jaydeep

On Fri, Feb 2, 2024 at 8:30 AM Caleb Rackliffe <calebrackli...@gmail.com>
wrote:

> I just remembered the other day that I had done a quick writeup on the
> state of compaction stress-related throttling in the project:
>
>
> https://docs.google.com/document/d/1dfTEcKVidRKC1EWu3SO1kE1iVLMdaJ9uY1WMpS3P_hs/edit?usp=sharing
>
> I'm sure most of it is old news to the people on this thread, but I
> figured I'd post it just in case :)
>
> On Tue, Jan 30, 2024 at 11:58 AM Josh McKenzie <jmcken...@apache.org>
> wrote:
>
>> 2.) We should make sure the links between the "known" root causes of
>> cascading failures and the mechanisms we introduce to avoid them remain
>> very strong.
>>
>> Seems to me that our historical strategy was to address individual known
>> cases one-by-one rather than looking for a more holistic load-balancing and
>> load-shedding solution. While the engineer in me likes the elegance of a
>> broad, more-inclusive *actual SEDA-like* approach, the pragmatist in me
>> wonders how far we think we are today from a stable set-point.
>>
>> i.e. are we facing a handful of cases where nodes can still get pushed
>> over and then cascade that we can surgically address, or are we facing a
>> broader lack of back-pressure that rears its head in different domains
>> (client -> coordinator, coordinator -> replica, internode with other
>> operations, etc) at surprising times and should be considered more
>> holistically?
>>
>> On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
>>
>> I almost forgot CASSANDRA-15817, which introduced
>> reject_repair_compaction_threshold, which provides a mechanism to stop
>> repairs while compaction is underwater.
>>
>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe <calebrackli...@gmail.com>
>> wrote:
>>
>> 
>> Hey all,
>>
>> I'm a bit late to the discussion. I see that we've already discussed
>> CASSANDRA-15013 <https://issues.apache.org/jira/browse/CASSANDRA-15013>
>>  and CASSANDRA-16663
>> <https://issues.apache.org/jira/browse/CASSANDRA-16663> at least in
>> passing. Having written the latter, I'd be the first to admit it's a crude
>> tool, although it's been useful here and there, and provides a couple
>> primitives that may be useful for future work. As Scott mentions, while it
>> is configurable at runtime, it is not adaptive, although we did
>> make configuration easier in CASSANDRA-17423
>> <https://issues.apache.org/jira/browse/CASSANDRA-17423>. It also is
>> global to the node, although we've lightly discussed some ideas around
>> making it more granular. (For example, keyspace-based limiting, or limiting
>> "domains" tagged by the client in requests, could be interesting.) It also
>> does not deal with inter-node traffic, of course.
>>
>> Something we've not yet mentioned (that does address internode traffic)
>> is CASSANDRA-17324
>> <https://issues.apache.org/jira/browse/CASSANDRA-17324>, which I
>> proposed shortly after working on the native request limiter (and have just
>> not had much time to return to). The basic idea is this:
>>
>> When a node is struggling under the weight of a compaction backlog and
>> becomes a cause of increased read latency for clients, we have two safety
>> valves:
>>
>> 1.) Disabling the native protocol server, which stops the node from
>> coordinating reads and writes.
>> 2.) Jacking up the severity on the node, which tells the dynamic snitch
>> to avoid the node for reads from other coordinators.
>>
>> These are useful, but we don’t appear to have any mechanism that would
>> allow us to temporarily reject internode hint, batch, and mutation messages
>> that could further delay resolution of the compaction backlog.
>>
>>
>> Whether it's done as part of a larger framework or on its own, it still
>> feels like a good idea.
>>
>> Thinking in terms of opportunity costs here (i.e. where we spend our
>> finite engineering time to holistically improve the experience of operating
>> this database) is healthy, but we probably haven't reached the point of
>> diminishing returns on nodes being able to protect themselves from clients
>> and from other nodes. I would just keep in mind two things:
>>
>> 1.) The effectiveness of rate-limiting in the system (which includes the
>> database and all clients) as a whole necessarily decreases as we move from
>> the application to the lowest-level database internals. Limiting correctly
>> at the client will save more resources than limiting at the native protocol
>> server, and limiting correctly at the native protocol server will save more
>> resources than limiting after we've dispatched requests to some thread pool
>> for processing.
>> 2.) We should make sure the links between the "known" root causes of
>> cascading failures and the mechanisms we introduce to avoid them remain
>> very strong.
>>
>> In any case, I'd be happy to help out in any way I can as this moves
>> forward (especially as it relates to our past/current attempts to address
>> this problem space).
>>
>>
>>

Reply via email to