Re: Capabilities

Jordan West Thu, 19 Dec 2024 17:24:28 -0800

Firstly, glad to see the support and enthusiasm here and in the recent
Slack discussion. I think there is enough for me to start drafting a CEP.


Stefan, global configuration and capabilities do have some overlap but not
full overlap. For example, you may want to set globally that a cluster
enables feature X or control the threshold for a guardrail but you still
need to know if all nodes support feature X or have that guardrail, the
latter is what capabilities targets. I do think capabilities are a step
towards supporting global configuration and the work you described is
another step (that we could do after capabilities or in parallel with them
in mind). I am also supportive of exploring global configuration for the
reasons you mentioned.

In terms of how capabilities get propagated across the cluster, I hadn't
put much thought into it yet past likely TCM since this will be a new
feature that lands after TCM. In Riak, we had gossip (but more mature than
C*s -- this was an area I contributed to a lot so very familiar) to
disseminate less critical information such as capabilities and a separate
layer that did TCM. Since we don't have this in C* I don't think we would
want to build a separate distribution channel for capabilities metadata
when we already have TCM in place. But I plan to explore this more as I
draft the CEP.

Jordan

On Thu, Dec 19, 2024 at 1:48 PM Štefan Miklošovič <smikloso...@apache.org>
wrote:

> Hi Jordan,
>
> what would this look like from the implementation perspective? I was
> experimenting with transactional guardrails where an operator would control
> the content of a virtual table which would be backed by TCM so whatever
> guardrail we would change, this would be automatically and transparently
> propagated to every node in a cluster. The POC worked quite nicely. TCM is
> just a vehicle to commit a change which would spread around and all these
> settings would survive restarts. We would have the same configuration
> everywhere which is not currently the case because guardrails are
> configured per node and if not persisted to yaml, on restart their values
> would be forgotten.
>
> Guardrails are just an example, what is quite obvious is to expand this
> idea to the whole configuration in yaml. Of course, not all properties in
> yaml make sense to be the same cluster-wise (ip addresses etc ...), but the
> ones which do would be again set everywhere the same way.
>
> The approach I described above is that we make sure that the configuration
> is same everywhere, hence there can be no misunderstanding what features
> this or that node has, if we say that all nodes have to have a particular
> feature because we said so in TCM log so on restart / replay, a node with
> "catch up" with whatever features it is asked to turn on.
>
> Your approach seems to be that we distribute what all capabilities /
> features a cluster supports and that each individual node configures itself
> in some way or not to comply?
>
> Is there any intersection in these approaches? At first sight it seems
> somehow related. How is one different from another from your point of view?
>
> Regards
>
> (1) https://issues.apache.org/jira/browse/CASSANDRA-19593
>
> On Thu, Dec 19, 2024 at 12:00 AM Jordan West <jw...@apache.org> wrote:
>
>> In a recent discussion on the pains of upgrading one topic that came up
>> is a feature that Riak had called Capabilities [1]. A major pain with
>> upgrades is that each node independently decides when to start using new or
>> modified functionality. Even when we put this behind a config (like storage
>> compatibility mode) each node immediately enables the feature when the
>> config is changed and the node is restarted. This causes various types of
>> upgrade pain such as failed streams and schema disagreement. A
>> recent example of this is CASSANRA-20118 [2]. In some cases operators can
>> prevent this from happening through careful coordination (e.g. ensuring
>> upgrade sstables only runs after the whole cluster is upgraded) but
>> typically requires custom code in whatever control plane the operator is
>> using. A capabilities framework would distribute the state of what features
>> each node has (and their status e.g. enabled or not) so that the cluster
>> can choose to opt in to new features once the whole cluster has them
>> available. From experience, having this in Riak made upgrades a
>> significantly less risky process and also paved a path towards repeatable
>> downgrades. I think Cassandra would benefit from it as well.
>>
>> Further, other tools like analytics could benefit from having this
>> information since currently it's up to the operator to manually determine
>> the state of the cluster in some cases.
>>
>> I am considering drafting a CEP proposal for this feature but wanted to
>> take the general temperature of the community and get some early thoughts
>> while working on the draft.
>>
>> Looking forward to hearing y'alls thoughts,
>> Jordan
>>
>> [1]
>> https://github.com/basho/riak_core/blob/25d9a6fa917eb8a2e95795d64eb88d7ad384ed88/src/riak_core_capability.erl#L23-L72
>>
>> [2] https://issues.apache.org/jira/browse/CASSANDRA-20118
>>
>

Re: Capabilities

Reply via email to