Firstly, glad to see the support and enthusiasm here and in the recent Slack discussion. I think there is enough for me to start drafting a CEP.
Stefan, global configuration and capabilities do have some overlap but not full overlap. For example, you may want to set globally that a cluster enables feature X or control the threshold for a guardrail but you still need to know if all nodes support feature X or have that guardrail, the latter is what capabilities targets. I do think capabilities are a step towards supporting global configuration and the work you described is another step (that we could do after capabilities or in parallel with them in mind). I am also supportive of exploring global configuration for the reasons you mentioned. In terms of how capabilities get propagated across the cluster, I hadn't put much thought into it yet past likely TCM since this will be a new feature that lands after TCM. In Riak, we had gossip (but more mature than C*s -- this was an area I contributed to a lot so very familiar) to disseminate less critical information such as capabilities and a separate layer that did TCM. Since we don't have this in C* I don't think we would want to build a separate distribution channel for capabilities metadata when we already have TCM in place. But I plan to explore this more as I draft the CEP. Jordan On Thu, Dec 19, 2024 at 1:48 PM Štefan Miklošovič <smikloso...@apache.org> wrote: > Hi Jordan, > > what would this look like from the implementation perspective? I was > experimenting with transactional guardrails where an operator would control > the content of a virtual table which would be backed by TCM so whatever > guardrail we would change, this would be automatically and transparently > propagated to every node in a cluster. The POC worked quite nicely. TCM is > just a vehicle to commit a change which would spread around and all these > settings would survive restarts. We would have the same configuration > everywhere which is not currently the case because guardrails are > configured per node and if not persisted to yaml, on restart their values > would be forgotten. > > Guardrails are just an example, what is quite obvious is to expand this > idea to the whole configuration in yaml. Of course, not all properties in > yaml make sense to be the same cluster-wise (ip addresses etc ...), but the > ones which do would be again set everywhere the same way. > > The approach I described above is that we make sure that the configuration > is same everywhere, hence there can be no misunderstanding what features > this or that node has, if we say that all nodes have to have a particular > feature because we said so in TCM log so on restart / replay, a node with > "catch up" with whatever features it is asked to turn on. > > Your approach seems to be that we distribute what all capabilities / > features a cluster supports and that each individual node configures itself > in some way or not to comply? > > Is there any intersection in these approaches? At first sight it seems > somehow related. How is one different from another from your point of view? > > Regards > > (1) https://issues.apache.org/jira/browse/CASSANDRA-19593 > > On Thu, Dec 19, 2024 at 12:00 AM Jordan West <jw...@apache.org> wrote: > >> In a recent discussion on the pains of upgrading one topic that came up >> is a feature that Riak had called Capabilities [1]. A major pain with >> upgrades is that each node independently decides when to start using new or >> modified functionality. Even when we put this behind a config (like storage >> compatibility mode) each node immediately enables the feature when the >> config is changed and the node is restarted. This causes various types of >> upgrade pain such as failed streams and schema disagreement. A >> recent example of this is CASSANRA-20118 [2]. In some cases operators can >> prevent this from happening through careful coordination (e.g. ensuring >> upgrade sstables only runs after the whole cluster is upgraded) but >> typically requires custom code in whatever control plane the operator is >> using. A capabilities framework would distribute the state of what features >> each node has (and their status e.g. enabled or not) so that the cluster >> can choose to opt in to new features once the whole cluster has them >> available. From experience, having this in Riak made upgrades a >> significantly less risky process and also paved a path towards repeatable >> downgrades. I think Cassandra would benefit from it as well. >> >> Further, other tools like analytics could benefit from having this >> information since currently it's up to the operator to manually determine >> the state of the cluster in some cases. >> >> I am considering drafting a CEP proposal for this feature but wanted to >> take the general temperature of the community and get some early thoughts >> while working on the draft. >> >> Looking forward to hearing y'alls thoughts, >> Jordan >> >> [1] >> https://github.com/basho/riak_core/blob/25d9a6fa917eb8a2e95795d64eb88d7ad384ed88/src/riak_core_capability.erl#L23-L72 >> >> [2] https://issues.apache.org/jira/browse/CASSANDRA-20118 >> >