> 2.) We should make sure the links between the "known" root causes of 
> cascading failures and the mechanisms we introduce to avoid them remain very 
> strong.
Seems to me that our historical strategy was to address individual known cases 
one-by-one rather than looking for a more holistic load-balancing and 
load-shedding solution. While the engineer in me likes the elegance of a broad, 
more-inclusive *actual SEDA-like* approach, the pragmatist in me wonders how 
far we think we are today from a stable set-point. 

i.e. are we facing a handful of cases where nodes can still get pushed over and 
then cascade that we can surgically address, or are we facing a broader lack of 
back-pressure that rears its head in different domains (client -> coordinator, 
coordinator -> replica, internode with other operations, etc) at surprising 
times and should be considered more holistically?

On Tue, Jan 30, 2024, at 12:31 AM, Caleb Rackliffe wrote:
> I almost forgot CASSANDRA-15817, which introduced 
> reject_repair_compaction_threshold, which provides a mechanism to stop 
> repairs while compaction is underwater.
> 
>> On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe <calebrackli...@gmail.com> 
>> wrote:
>> 
>> Hey all,
>> 
>> I'm a bit late to the discussion. I see that we've already discussed 
>> CASSANDRA-15013 <https://issues.apache.org/jira/browse/CASSANDRA-15013> and 
>> CASSANDRA-16663 <https://issues.apache.org/jira/browse/CASSANDRA-16663> at 
>> least in passing. Having written the latter, I'd be the first to admit it's 
>> a crude tool, although it's been useful here and there, and provides a 
>> couple primitives that may be useful for future work. As Scott mentions, 
>> while it is configurable at runtime, it is not adaptive, although we did 
>> make configuration easier in CASSANDRA-17423 
>> <https://issues.apache.org/jira/browse/CASSANDRA-17423>. It also is global 
>> to the node, although we've lightly discussed some ideas around making it 
>> more granular. (For example, keyspace-based limiting, or limiting "domains" 
>> tagged by the client in requests, could be interesting.) It also does not 
>> deal with inter-node traffic, of course.
>> 
>> Something we've not yet mentioned (that does address internode traffic) is 
>> CASSANDRA-17324 <https://issues.apache.org/jira/browse/CASSANDRA-17324>, 
>> which I proposed shortly after working on the native request limiter (and 
>> have just not had much time to return to). The basic idea is this:
>> 
>>> When a node is struggling under the weight of a compaction backlog and 
>>> becomes a cause of increased read latency for clients, we have two safety 
>>> valves:
>>> 
>>> 
>>> 1.) Disabling the native protocol server, which stops the node from 
>>> coordinating reads and writes.
>>> 2.) Jacking up the severity on the node, which tells the dynamic snitch to 
>>> avoid the node for reads from other coordinators.
>>> 
>>> These are useful, but we don’t appear to have any mechanism that would 
>>> allow us to temporarily reject internode hint, batch, and mutation messages 
>>> that could further delay resolution of the compaction backlog.
>>> 
>> 
>> Whether it's done as part of a larger framework or on its own, it still 
>> feels like a good idea.
>> 
>> Thinking in terms of opportunity costs here (i.e. where we spend our finite 
>> engineering time to holistically improve the experience of operating this 
>> database) is healthy, but we probably haven't reached the point of 
>> diminishing returns on nodes being able to protect themselves from clients 
>> and from other nodes. I would just keep in mind two things:
>> 
>> 1.) The effectiveness of rate-limiting in the system (which includes the 
>> database and all clients) as a whole necessarily decreases as we move from 
>> the application to the lowest-level database internals. Limiting correctly 
>> at the client will save more resources than limiting at the native protocol 
>> server, and limiting correctly at the native protocol server will save more 
>> resources than limiting after we've dispatched requests to some thread pool 
>> for processing.
>> 2.) We should make sure the links between the "known" root causes of 
>> cascading failures and the mechanisms we introduce to avoid them remain very 
>> strong.
>> 
>> In any case, I'd be happy to help out in any way I can as this moves forward 
>> (especially as it relates to our past/current attempts to address this 
>> problem space).

Reply via email to