Very well written, David. Good points. I am happy we are talking about all of this stuff and where the discussion is going. TCM or not, I find it important we finally go through it all, irrelevant what we eventually end up with.
I would like to add a few points, especially to your last two paragraphs, in the context of guardrails and TCM, as that is what I was pushing before (maybe some of that is relevant to capabilities too). You are right about your YouFancyHuh example. In guardrails, that would translate to this: 1) 2 nodes cluster, both on version X 2) We upgrade node 1 to version X + 1 where we introduce a guardrail Y for the first time 3) When node 1 starts for the first time on X + 1, it gets TCM with some epoch, it looks into that and sees: "huh, no config for Y yet? Let's add that". 4) 3) will result in a new commit for TCM, that creates a new epoch, E + 1 5) epoch E + 1 gets propagated to node 2 which is still on version X where guardrail Y does not exist yet, now what? In this case, it would need to be coded up in such a way that if node 2 does not recognise what it is pushed to that, it will just skip it. I would just ignore it. It would not fail on reading what Y is, but it would just not act on that at all. When node 2 is upgraded to X + 1 as well and it is restarted, it would replay all TCM log up to E + 1. But now, it knows how to act on Y, so it would apply that configuration to itself. There is a similar situation when we are going to deprecate / remove a guardrail / configuration, e.g. Y. What does it mean that we "removed functionality Y"? It just means that Y in TCM would not be acted on. The configuration for Y would still stay in TCM, it would just that node would be indifferent to that. When it comes to your point about overriding the configuration. I can tell only about guardrails here. Guardrails are configured by JMX too. So you can override them in runtime. There would be two ways to configure guardrails: 1) via a virtual table, backed by TCM, as shown in CASSANDRA-19593 2) via JMX. While 1) would be cluster-wide, if you change a guardrail via 2), there is no need to be committed to TCM at all. Committing would be done only in case you interact with that via a vtable. JMX's way of configuring it would just override that locally, without committing. Now the interesting corner-case is - what should happen when I diverged global configuration for a guardrail Y via JMX and there is a new epoch E + 2 where that particular guardrail changed its value? I think that we could either ignore that or forcibly set it to what E + 2 has. On Mon, Jan 6, 2025 at 11:32 PM David Capwell <dcapw...@apple.com> wrote: > Stefan, global configuration and capabilities do have some overlap but not > full overlap. For example, you may want to set globally that a cluster > enables feature X or control the threshold for a guardrail but you still > need to know if all nodes support feature X or have that guardrail, the > latter is what capabilities targets. I do think capabilities are a step > towards supporting global configuration and the work you described is > another step (that we could do after capabilities or in parallel with them > in mind). I am also supportive of exploring global configuration for the > reasons you mentioned. > > > I personally find this distinction really important when thinking about > this thread… > > A ticket on my plate is to have all Accord messages user their own > serialization version rather than rely on messaging’s version. This adds > the following problem to this ticket, “what versions do each node support”, > which to me feels like a capability. Just because a node supports V2 > (doesn't exist at the moment) doesn’t mean that V2 is enabled, or that its > safe to enable cross the cluster… it just means a node supports V2 (or has > this capability). > > In this example then I need to also answer how we allow v2 for the > cluster… is this an atomic all-at-once action, or is it staged (few nodes > at a time)? I kinda feel that staging is best as there could be a cluster > outage by enabling, so you want to limit and roll out as you see its > working and safe… so maybe a local config per node and not in TCM, but the > fact v2 is supported is? > > I honestly don’t see why capabilities shouldn’t be in TCM, as its really > just telling every node in the cluster what can be done, but also I think > we should be caution and really ask “does X need to be global?”. For > example, the fact a node supports SAI impacts streaming, but if the files > are not understood do we just ignore them? So is it safe/good to avoid > defining SAI as a capability? What about BTI? If you stream a BTI file > over to a node that doesn’t know it, then there be dragons… so maybe BTI is > a capability? What about BIG versions? > > Now about the TCM side of things… let’s assume we are in a mixed version > case… 5.1 and 5.2 (5.2 adds a new file format called YFH (“YouFancyHuh” ™)). > The new 5.2 node starts up and reports its new and awesome capability of > YFH, this then propagates to the 5.1 nodes that have no idea what YFH is > (but MUST be able to parse this)…. So to me this feels like the only safe > way to really define capabilities is to define a enum (you exist or you > don’t), or map<string, string>, anything else seems like its going to > cause issues with mixed mode. > > With regard to TCM configs (such as guard rails), I feel that its still > best to be local… I have been involved with clusters where we make these > configs consistent cross the cluster, then on a hand full of nodes we > change the configs… this has the benefit of enabling features for moments > of time and on select nodes (2i can be blocked by default, but when ops > want to allow a table to have 2i they enable on a single node, do the ALTER > on that node…)… if you start to move configs to TCM then how do we do > staged or partial rollout? What about temporary configs (like enabling 2i > for a few seconds)? > > > On Jan 6, 2025, at 2:09 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > What about finally adding a much desired EverywhereStrategy? It wouldn't > just be useful for config - system_auth bites a lot of people today. > > As much as I don't like to suggest row cache, it might be a good fit here > as well. We could remove the custom code around auth cache in the process. > > Jon > > On Mon, Jan 6, 2025 at 12:48 PM Benedict Elliott Smith < > bened...@apache.org> wrote: > >> The more we talk about this, the more my position crystallises against >> this approach. The feature we’re discussing here should be easy to >> implement on top of user facing functionality; we aren’t the only people >> who want functionality like this. We should be dogfooding our own UX for >> this kind of capability. >> >> TCM is unique in that it *cannot* dogfood the database. As a result is >> is not only critical for correctness, it’s also more complex - and >> inefficient - than a native database feature could be. It’s the worst of >> both worlds: we couple critical functionality to non-critical features, and >> couple those non-critical features to more complex logic than they need. >> >> My vote would be to introduce a new table feature that provides a >> node-local time bounded cache, so that you can safely perform CL.ONE >> queries against it, and let the whole world use it. >> >> >> On 6 Jan 2025, at 18:23, Blake Eggleston <beggles...@apple.com> wrote: >> >> TCM was designed with a couple of very specific correctness-critical use >> cases in mind, not as a generic mechanism for everyone to extend. >> >> >> Its initial scope was for those use cases, but it’s potential for >> enabling more sophisticated functionality was one of its selling points and >> is listed in the CEP. >> >> Folks transitively breaking cluster membership by accidentally breaking >> the shared dependency of a non-critical feature is a risk I don’t like much. >> >> >> Having multiple distributed config systems operating independently is >> going to create it’s own set of problems, especially if the distributed >> config has any level of interaction with schema or topology. >> >> I lean towards distributed config going into TCM, although a more >> friendly api for extension that offers some guardrails would be a good idea. >> >> On Jan 6, 2025, at 9:21 AM, Aleksey Yeshchenko <alek...@apple.com> wrote: >> >> Would you mind elaborating on what makes it unsuitable? I don’t have a >> good mental model on its properties, so i assumed that it could be used to >> disseminate arbitrary key value pairs like config fairly easily. >> >> >> It’s more than *capable* of disseminating arbitrary-ish key-value pairs - >> it can deal with schema after all. >> >> I claim it to be *unsuitable* because of the coupling it would introduce >> between components of different levels of criticality. You can derisk it >> partially by having separate logs (which might not be trivial to >> implement). But unless you also duplicate all the TCM logic in some other >> package, the shared code dependency coupling persists. Folks transitively >> breaking cluster membership by accidentally breaking the shared dependency >> of a non-critical feature is a risk I don’t like much. Keep it tight, >> single-purpose, let it harden over time without being disrupted. >> >> On 6 Jan 2025, at 16:54, Aleksey Yeshchenko <alek...@apple.com> wrote: >> >> I agree that this would be useful, yes. >> >> An LWT/Accord variant plus a plain writes eventually consistent variant. >> A generic-by-design internal-only per-table mechanism with optional caching >> + optional write notifications issued to non-replicas. >> >> On 6 Jan 2025, at 14:26, Josh McKenzie <jmcken...@apache.org> wrote: >> >> I think if we go down the route of pushing configs around with LWT + >> caching instead, we should have that be a generic system that is designed >> for everyone to use. >> >> Agreed. Otherwise we end up with the same problem Aleksey's speaking >> about above, where we build something for a specific purpose and then >> maintainers in the future with a reasonable need extend or bend it to fit >> their new need, risking destabilizing the original implementation. >> >> Better to have a solid shared primitive other features can build upon. >> >> On Mon, Jan 6, 2025, at 8:33 AM, Jon Haddad wrote: >> >> Would you mind elaborating on what makes it unsuitable? I don’t have a >> good mental model on its properties, so i assumed that it could be used to >> disseminate arbitrary key value pairs like config fairly easily. >> >> Somewhat humorously, i think that same assumption was made when putting >> sai metadata into gossip which caused a cluster with 800 2i to break it. >> >> I think if we go down the route of pushing configs around with LWT + >> caching instead, we should have that be a generic system that is designed >> for everyone to use. Then we have a gossip replacement, reduce config >> clutter, and people have something that can be used without adding another >> bespoke system into the mix. >> >> Jon >> >> On Mon, Jan 6, 2025 at 6:48 AM Aleksey Yeshchenko <alek...@apple.com> >> wrote: >> >> TCM was designed with a couple of very specific correctness-critical use >> cases in mind, not as a generic mechanism for everyone to extend. >> >> It might be *convenient* to employ TCM for some other features, which >> makes it tempting to abuse TCM for an unintended purpose, but we shouldn’t >> do what's convenient over what is right. There are several ways this often >> goes wrong. >> >> For example, the sybsystem gets used as is, without modification, by a >> new feature, but in ways that invalidate the assumptions behind the design >> of the subsystem - designed for particular use cases. >> >> For another example, the subsystem *almost* works as is for the new >> feature, but doesn't *quite* work as is, so changes are made to it, and >> reviewed, by someone not familiar enough with the subsystem design and >> implementation. One of such changes eventually introduces a bug to the >> shared critical subsystem, and now everyone is having a bad time. >> >> The risks are real, and I’d strongly prefer that we didn’t co-opt a >> critical subsystem for a non-critical use-case for this reason alone. >> >> On 21 Dec 2024, at 23:18, Jordan West <jorda...@gmail.com> wrote: >> >> I tend to lean towards Josh's perspective. Gossip was poorly tested and >> implemented. I dont think it's a good parallel or at least I hope it's not. >> Taken to the extreme we shouldn't touch the database at all otherwise, >> which isn't practical. That said, anything touching important subsystems >> needs more care, testing, and time to bake. I think we're mostly discussing >> "being careful" of which I am totally on board with. I don't think Benedict >> ever said "don't use TCM", in fact he's said the opposite, but emphasized >> the care that is required when we do, which is totally reasonable. >> >> Back to capabilities, Riak built them on an eventually consistent >> subsystem and they worked fine. If you have a split brain you likely dont >> want to communicate agreement as is (or have already learned about >> agreement and its not an issue). That said, I don't think we have an EC >> layer in C* I would want to rely on outside of distributed tables. So in >> the context of what we have existing I think TCM is a better fit. I still >> need to dig a little more to be convinced and plan to do that as I draft >> the CEP. >> >> Jordan >> >> On Sat, Dec 21, 2024 at 5:51 AM Benedict <bened...@apache.org> wrote: >> >> >> I’m not saying we need to tease out bugs from TCM. I’m saying every time >> someone touches something this central to correctness we introduce a risk >> of breaking it, and that we should exercise that risk judiciously. This has >> zero to do with the amount of data we’re pushing through it, and 100% to do >> with writing bad code. >> >> We treated gossip carefully in part because it was hard to work with, but >> in part because getting it wrong was particularly bad. We should retain the >> latter reason for caution. >> >> We also absolutely do not need TCM for consistency. We have consistent >> database functionality for that. TCM is special because it cannot rely on >> the database mechanisms, as it underpins them. That is the whole point of >> why we should treat it carefully. >> >> On 21 Dec 2024, at 13:43, Josh McKenzie <jmcken...@apache.org> wrote: >> >> >> To play the devil's advocate - the more we exercise TCM the more bugs we >> suss out. To Jon's point, the volume of information we're talking about >> here in terms of capabilities dissemination shouldn't stress TCM at all. >> >> I think a reasonable heuristic for relying on TCM for something is >> whether there's a big difference in UX on something being eventually >> consistent vs. strongly consistent. Exposing features to clients based on >> whether the entire cluster supports them seems like the kind of thing that >> could cause pain if we're in a split-brain, >> cluster-is-settling-on-agreement kind of paradigm. >> >> On Fri, Dec 20, 2024, at 3:17 PM, Benedict wrote: >> >> >> Mostly conceptual; the problem with a linearizable history is that if you >> lose some of it (eg because some logic bug prevents you from processing >> some epoch) you stop the world until an operator can step in to perform >> surgery about what the history should be. >> >> I do know of one recent bug to schema changes in cep-15 that broke TCM in >> this way. That particular avenue will be hardened, but the fewer places we >> risk this the better IMO. >> >> Of course, there are steps we could take to expose a limited API >> targeting these use cases, as well as using a separate log for ancillary >> functionality, that might better balance risk:reward. But equally I’m not >> sure it makes sense to TCM all the things, and maybe dogfooding our own >> database features and developing functionality that enables our own use >> cases could be better where it isn’t necessary 🤷♀️ >> >> >> On 20 Dec 2024, at 19:22, Jordan West <jorda...@gmail.com> wrote: >> >> >> On Fri, Dec 20, 2024 at 11:06 AM Benedict <bened...@apache.org> wrote: >> >> >> If TCM breaks we all have a really bad time, much worse than if any one >> of these features individually has problems. If you break TCM in the right >> way the cluster could become inoperable, or operations like topology >> changes may be prevented. >> >> >> Benedict, when you say this are you speaking hypothetically (in the sense >> that by using TCM more we increase the probability of using it "wrong" and >> hitting an unknown edge case) or are there known ways today that TCM >> "breaks"? >> >> Jordan >> >> >> This means that even a parallel log has some risk if we end up modifying >> shared functionality. >> >> >> >> >> On 20 Dec 2024, at 18:47, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> >> I stand corrected. C in TCM is "cluster" :D Anyway. Configuration is >> super reasonable to be put there. >> >> On Fri, Dec 20, 2024 at 7:42 PM Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> I am super hesitant to base distributed guardrails or any configuration >> for that matter on anything but TCM. Does not "C" in TCM stand for >> "configuration" anyway? So rename it to TSM like "schema" then if it is >> meant to be just for that. It seems to be quite ridiculous to code tables >> with caches on top when we have way more effective tooling thanks to CEP-21 >> to deal with that with clear advantages of getting rid of all of that old >> mechanism we have in place. >> >> I have not seen any concrete examples of risks why using TCM should be >> just for what it is currently for. Why not put the configuration meant to >> be cluster-wide into that? >> >> What is it ... performance? What does even the term "additional >> complexity" mean? Complex in what? Do you think that putting there 3 types >> of transformations in case of guardrails which flip some booleans and >> numbers would suddenly make TCM way more complex? Come on ... >> >> This has nothing to do with what Jordan is trying to introduce. I think >> we all agree he knows what he is doing and if he evaluates that TCM is too >> much for his use case (or it is not a good fit) that is perfectly fine. >> >> On Fri, Dec 20, 2024 at 7:22 PM Paulo Motta <pa...@apache.org> wrote: >> >> > It should be possible to use distributed system tables just fine for >> capabilities, config and guardrails. >> >> I have been thinking about this recently and I agree we should be wary >> about introducing new TCM states and create additional complexity that can >> be serviced by existing data dissemination mechanisms (gossip/system >> tables). I would prefer that we take a more phased and incremental approach >> to introduce new TCM states. >> >> As a way to accomplish that, I have thought about introducing a new >> generic TCM state "In Maintenance", where schema or membership changes are >> "frozen/disallowed" while an external operation is taking place. This >> "external operation" could mean many things: >> - Upgrade >> - Downgrade >> - Migration >> - Capability Enablement/Disablement >> >> These could be sub-states of the "Maintenance" TCM state, that could be >> managed externally (via cache/gossip/system tables/sidecar). Once these >> sub-states are validated thouroughly and mature enough, we could "promote" >> them to top-level TCM states. >> >> In the end what really matters is that cluster and schema membership >> changes do not happen while a miscellaneous operation is taking place. >> >> Would this make sense as an initial way to integrate TCM with >> capabilities framework ? >> >> On Fri, Dec 20, 2024 at 4:53 AM Benedict <bened...@apache.org> wrote: >> >> >> If you perform a read from a distributed table on startup you will find >> the latest information. What catchup are you thinking of? I don’t think any >> of the features we talked about need a log, only the latest information. >> >> We can (and should) probably introduce event listeners for distributed >> tables, as this is also a really great feature, but I don’t think this >> should be necessary here. >> >> Regarding disagreements: if you use LWTs then there are no consistency >> issues to worry about. >> >> Again, I’m not opposed to using TCM, although I am a little worried TCM >> is becoming our new hammer with everything a nail. It would be better IMO >> to keep TCM scoped to essential functionality as it’s critical to >> correctness. Perhaps we could extend its APIs to less critical services >> without intertwining them with membership, schema and epoch handling. >> >> >> On 20 Dec 2024, at 09:43, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> >> I find TCM way more comfortable to work with. The capability of log being >> replayed on restart and catching up with everything else automatically is >> god-sent. If we had that on "good old distributed tables", then is it not >> true that we would need to take extra care of that, e.g. we would need to >> repair it etc ... It might be the source of the discrepancies / >> disagreements etc. TCM is just "maintenance-free" and _just works_. >> >> I think I was also investigating distributed tables but was just pulled >> towards TCM naturally because of its goodies. >> >> On Fri, Dec 20, 2024 at 10:08 AM Benedict <bened...@apache.org> wrote: >> >> >> TCM is a perfectly valid basis for this, but TCM is only really >> *necessary* to solve meta config problems where we can’t rely on the rest >> of the database working. Particularly placement issues, which is why schema >> and membership need to live there. >> >> It should be possible to use distributed system tables just fine for >> capabilities, config and guardrails. >> >> That said, it’s possible config might be better represented as part of >> the schema (and we already store some relevant config there) in which case >> it would live in TCM automatically. Migrating existing configs to a >> distributed setup will be fun however we do it though. >> >> Capabilities also feel naturally related to other membership information, >> so TCM might be the most suitable place, particularly for handling >> downgrades after capabilities have been enabled (if we ever expect to >> support turning off capabilities and then downgrading - which today we >> mostly don’t). >> >> >> On 20 Dec 2024, at 08:42, Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> >> Jordan, >> >> I also think that having it on TCM would be ideal and we should explore >> this path first before doing anything custom. >> >> Regarding my idea about the guardrails in TCM, when I prototyped that and >> wanted to make it happen, there was a little bit of a pushback (1) (even >> though super reasonable one) that TCM is just too young at the moment and >> it would be desirable to go through some stabilisation period. >> >> Another idea was that we should not make just guardrails happen but the >> whole config should be in TCM. From what I put together, Sam / Alex does >> not seem to be opposed to this idea, rather the opposite, but having CEP >> about that is way more involved than having just guardrails there. I >> consider guardrails to be kind of special and I do not think that having >> all configurations in TCM (which guardrails are part of) is the absolute >> must in order to deliver that. I may start with guardrails CEP and you may >> explore Capabilities CEP on TCM too, if that makes sense? >> >> I just wanted to raise the point about the time this would be delivered. >> If Capabilities are built on TCM and I wanted to do Guardrails on TCM too >> but was explained it is probably too soon, I guess you would experience >> something similar. >> >> Sam's comment is from May and maybe a lot has changed since in then and >> his comment is not applicable anymore. It would be great to know if we >> could build on top of the current trunk already or we will wait until >> 5.1/6.0 is delivered. >> >> (1) >> https://issues.apache.org/jira/browse/CASSANDRA-19593?focusedCommentId=17844326&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17844326 >> >> On Fri, Dec 20, 2024 at 2:17 AM Jordan West <jorda...@gmail.com> wrote: >> >> Firstly, glad to see the support and enthusiasm here and in the recent >> Slack discussion. I think there is enough for me to start drafting a CEP. >> >> Stefan, global configuration and capabilities do have some overlap but >> not full overlap. For example, you may want to set globally that a cluster >> enables feature X or control the threshold for a guardrail but you still >> need to know if all nodes support feature X or have that guardrail, the >> latter is what capabilities targets. I do think capabilities are a step >> towards supporting global configuration and the work you described is >> another step (that we could do after capabilities or in parallel with them >> in mind). I am also supportive of exploring global configuration for the >> reasons you mentioned. >> >> In terms of how capabilities get propagated across the cluster, I hadn't >> put much thought into it yet past likely TCM since this will be a new >> feature that lands after TCM. In Riak, we had gossip (but more mature than >> C*s -- this was an area I contributed to a lot so very familiar) to >> disseminate less critical information such as capabilities and a separate >> layer that did TCM. Since we don't have this in C* I don't think we would >> want to build a separate distribution channel for capabilities metadata >> when we already have TCM in place. But I plan to explore this more as I >> draft the CEP. >> >> Jordan >> >> On Thu, Dec 19, 2024 at 1:48 PM Štefan Miklošovič <smikloso...@apache.org> >> wrote: >> >> Hi Jordan, >> >> what would this look like from the implementation perspective? I was >> experimenting with transactional guardrails where an operator would control >> the content of a virtual table which would be backed by TCM so whatever >> guardrail we would change, this would be automatically and transparently >> propagated to every node in a cluster. The POC worked quite nicely. TCM is >> just a vehicle to commit a change which would spread around and all these >> settings would survive restarts. We would have the same configuration >> everywhere which is not currently the case because guardrails are >> configured per node and if not persisted to yaml, on restart their values >> would be forgotten. >> >> Guardrails are just an example, what is quite obvious is to expand this >> idea to the whole configuration in yaml. Of course, not all properties in >> yaml make sense to be the same cluster-wise (ip addresses etc ...), but the >> ones which do would be again set everywhere the same way. >> >> The approach I described above is that we make sure that the >> configuration is same everywhere, hence there can be no misunderstanding >> what features this or that node has, if we say that all nodes have to have >> a particular feature because we said so in TCM log so on restart / replay, >> a node with "catch up" with whatever features it is asked to turn on. >> >> Your approach seems to be that we distribute what all capabilities / >> features a cluster supports and that each individual node configures itself >> in some way or not to comply? >> >> Is there any intersection in these approaches? At first sight it seems >> somehow related. How is one different from another from your point of view? >> >> Regards >> >> (1) https://issues.apache.org/jira/browse/CASSANDRA-19593 >> >> On Thu, Dec 19, 2024 at 12:00 AM Jordan West <jw...@apache.org> wrote: >> >> In a recent discussion on the pains of upgrading one topic that came up >> is a feature that Riak had called Capabilities [1]. A major pain with >> upgrades is that each node independently decides when to start using new or >> modified functionality. Even when we put this behind a config (like storage >> compatibility mode) each node immediately enables the feature when the >> config is changed and the node is restarted. This causes various types of >> upgrade pain such as failed streams and schema disagreement. A >> recent example of this is CASSANRA-20118 [2]. In some cases operators can >> prevent this from happening through careful coordination (e.g. ensuring >> upgrade sstables only runs after the whole cluster is upgraded) but >> typically requires custom code in whatever control plane the operator is >> using. A capabilities framework would distribute the state of what features >> each node has (and their status e.g. enabled or not) so that the cluster >> can choose to opt in to new features once the whole cluster has them >> available. From experience, having this in Riak made upgrades a >> significantly less risky process and also paved a path towards repeatable >> downgrades. I think Cassandra would benefit from it as well. >> >> Further, other tools like analytics could benefit from having this >> information since currently it's up to the operator to manually determine >> the state of the cluster in some cases. >> >> I am considering drafting a CEP proposal for this feature but wanted to >> take the general temperature of the community and get some early thoughts >> while working on the draft. >> >> Looking forward to hearing y'alls thoughts, >> Jordan >> >> [1] >> https://github.com/basho/riak_core/blob/25d9a6fa917eb8a2e95795d64eb88d7ad384ed88/src/riak_core_capability.erl#L23-L72 >> >> [2] https://issues.apache.org/jira/browse/CASSANDRA-20118 >> >> >> >> >> >> >