Hi Jaydeep,

That seems quite interesting. Couple points though:

1) It would be nice if there is a way to "subscribe" to decisions your
detection framework comes up with. Integration with e.g. diagnostics
subsystem would be beneficial. This should be pluggable - just coding up an
interface to dump / react on the decisions how I want. This might also act
as a notifier to other systems, e-mail, slack channels ...

2) Have you tried to incorporate this with the Guardrails framework? I
think that if something is detected to be throttled or rejected (e.g
writing to a table), there might be a guardrail which would be triggered
dynamically in runtime. Guardrails are useful as such but here we might
reuse them so we do not need to code it twice.

3) I am curious how complex this detection framework would be, it can be
complicated pretty fast I guess. What would be desirable is to act on it in
such a way that you will not put that node under even more pressure. In
other words, your detection system should work in such a way that there
will not be any "doom loop" whereby mere throttling of various parts of
Cassandra you make it even worse for other nodes in the cluster. For
example, if a particular node starts to be overwhelmed and you detect this
and requests start to be rejected, is it not possible that Java driver
would start to see this node as "erroneous" with delayed response time etc
and it would start to prefer other nodes in the cluster when deciding what
node to contact for query coordination? So you would put more load on other
nodes, making them more susceptible to be throttled as well ...

Regards

Stefan Miklosovic

On Tue, Jan 16, 2024 at 6:41 PM Jaydeep Chovatia <chovatia.jayd...@gmail.com>
wrote:

> Hi,
>
> Happy New Year!
>
> I would like to discuss the following idea:
>
> Open-source Cassandra (CASSANDRA-15013
> <https://issues.apache.org/jira/browse/CASSANDRA-15013>) has an
> elementary built-in memory rate limiter based on the incoming payload from
> user requests. This rate limiter activates if any incoming user request’s
> payload exceeds certain thresholds. However, the existing rate limiter only
> solves limited-scope issues. Cassandra's server-side meltdown due to
> overload is a known problem. Often we see that a couple of busy nodes take
> down the entire Cassandra ring due to the ripple effect. The following
> document proposes a generic purpose comprehensive rate limiter that works
> considering system signals, such as CPU, and internal signals, such as
> thread pools. The rate limiter will have knobs to filter out internal
> traffic, system traffic, replication traffic, and furthermore based on the
> types of queries.
>
> More design details to this doc: [OSS] Cassandra Generic Purpose Rate
> Limiter - Google Docs
> <https://docs.google.com/document/d/1w-A3fnoeBS6tS1ffBda_R0QR90olzFoMqLE7znFEUrQ/edit>
>
> Please let me know your thoughts.
>
> Jaydeep
>

Reply via email to