Re: Capabilities

Aleksey Yeshchenko Mon, 06 Jan 2025 08:55:04 -0800

I agree that this would be useful, yes.

An LWT/Accord variant plus a plain writes eventually consistent variant. A 
generic-by-design internal-only per-table mechanism with optional caching + 
optional write notifications issued to non-replicas.


> On 6 Jan 2025, at 14:26, Josh McKenzie <[email protected]> wrote:
> 
>> I think if we go down the route of pushing configs around with LWT + caching 
>> instead, we should have that be a generic system that is designed for 
>> everyone to use. 
> Agreed. Otherwise we end up with the same problem Aleksey's speaking about 
> above, where we build something for a specific purpose and then maintainers 
> in the future with a reasonable need extend or bend it to fit their new need, 
> risking destabilizing the original implementation.
> 
> Better to have a solid shared primitive other features can build upon.
> 
> On Mon, Jan 6, 2025, at 8:33 AM, Jon Haddad wrote:
>> Would you mind elaborating on what makes it unsuitable? I don’t have a good 
>> mental model on its properties, so i assumed that it could be used to 
>> disseminate arbitrary key value pairs like config fairly easily. 
>> 
>> Somewhat humorously, i think that same assumption was made when putting sai 
>> metadata into gossip which caused a cluster with 800 2i to break it. 
>> 
>> I think if we go down the route of pushing configs around with LWT + caching 
>> instead, we should have that be a generic system that is designed for 
>> everyone to use. Then we have a gossip replacement, reduce config clutter, 
>> and people have something that can be used without adding another bespoke 
>> system into the mix. 
>> 
>> Jon 
>> 
>> On Mon, Jan 6, 2025 at 6:48 AM Aleksey Yeshchenko <[email protected] 
>> <mailto:[email protected]>> wrote:
>> TCM was designed with a couple of very specific correctness-critical use 
>> cases in mind, not as a generic mechanism for everyone to extend.
>> 
>> It might be *convenient* to employ TCM for some other features, which makes 
>> it tempting to abuse TCM for an unintended purpose, but we shouldn’t do 
>> what's convenient over what is right. There are several ways this often goes 
>> wrong.
>> 
>> For example, the sybsystem gets used as is, without modification, by a new 
>> feature, but in ways that invalidate the assumptions behind the design of 
>> the subsystem - designed for particular use cases.
>> 
>> For another example, the subsystem *almost* works as is for the new feature, 
>> but doesn't *quite* work as is, so changes are made to it, and reviewed, by 
>> someone not familiar enough with the subsystem design and implementation. 
>> One of such changes eventually introduces a bug to the shared critical 
>> subsystem, and now everyone is having a bad time.
>> 
>> The risks are real, and I’d strongly prefer that we didn’t co-opt a critical 
>> subsystem for a non-critical use-case for this reason alone.
>> 
>>> On 21 Dec 2024, at 23:18, Jordan West <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I tend to lean towards Josh's perspective. Gossip was poorly tested and 
>>> implemented. I dont think it's a good parallel or at least I hope it's not. 
>>> Taken to the extreme we shouldn't touch the database at all otherwise, 
>>> which isn't practical. That said, anything touching important subsystems 
>>> needs more care, testing, and time to bake. I think we're mostly discussing 
>>> "being careful" of which I am totally on board with. I don't think Benedict 
>>> ever said "don't use TCM", in fact he's said the opposite, but emphasized 
>>> the care that is required when we do, which is totally reasonable. 
>>>   
>>> Back to capabilities, Riak built them on an eventually consistent subsystem 
>>> and they worked fine. If you have a split brain you likely dont want to 
>>> communicate agreement as is (or have already learned about agreement and 
>>> its not an issue). That said, I don't think we have an EC layer in C* I 
>>> would want to rely on outside of distributed tables. So in the context of 
>>> what we have existing I think TCM is a better fit. I still need to dig a 
>>> little more to be convinced and plan to do that as I draft the CEP.
>>> 
>>> Jordan
>>> 
>>> On Sat, Dec 21, 2024 at 5:51 AM Benedict <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I’m not saying we need to tease out bugs from TCM. I’m saying every time 
>>> someone touches something this central to correctness we introduce a risk 
>>> of breaking it, and that we should exercise that risk judiciously. This has 
>>> zero to do with the amount of data we’re pushing through it, and 100% to do 
>>> with writing bad code.
>>> 
>>> We treated gossip carefully in part because it was hard to work with, but 
>>> in part because getting it wrong was particularly bad. We should retain the 
>>> latter reason for caution.
>>> 
>>> We also absolutely do not need TCM for consistency. We have consistent 
>>> database functionality for that. TCM is special because it cannot rely on 
>>> the database mechanisms, as it underpins them. That is the whole point of 
>>> why we should treat it carefully.
>>> 
>>>> On 21 Dec 2024, at 13:43, Josh McKenzie <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> To play the devil's advocate - the more we exercise TCM the more bugs we 
>>>> suss out. To Jon's point, the volume of information we're talking about 
>>>> here in terms of capabilities dissemination shouldn't stress TCM at all.
>>>> 
>>>> I think a reasonable heuristic for relying on TCM for something is whether 
>>>> there's a big difference in UX on something being eventually consistent 
>>>> vs. strongly consistent. Exposing features to clients based on whether the 
>>>> entire cluster supports them seems like the kind of thing that could cause 
>>>> pain if we're in a split-brain, cluster-is-settling-on-agreement kind of 
>>>> paradigm.
>>>> 
>>>> On Fri, Dec 20, 2024, at 3:17 PM, Benedict wrote:
>>>>> 
>>>>> Mostly conceptual; the problem with a linearizable history is that if you 
>>>>> lose some of it (eg because some logic bug prevents you from processing 
>>>>> some epoch) you stop the world until an operator can step in to perform 
>>>>> surgery about what the history should be.
>>>>> 
>>>>> I do know of one recent bug to schema changes in cep-15 that broke TCM in 
>>>>> this way. That particular avenue will be hardened, but the fewer places 
>>>>> we risk this the better IMO. 
>>>>> 
>>>>> Of course, there are steps we could take to expose a limited API 
>>>>> targeting these use cases, as well as using a separate log for ancillary 
>>>>> functionality, that might better balance risk:reward. But equally I’m not 
>>>>> sure it makes sense to TCM all the things, and maybe dogfooding our own 
>>>>> database features and developing functionality that enables our own use 
>>>>> cases could be better where it isn’t necessary 🤷‍♀️
>>>>> 
>>>>> 
>>>>>> On 20 Dec 2024, at 19:22, Jordan West <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> On Fri, Dec 20, 2024 at 11:06 AM Benedict <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> If TCM breaks we all have a really bad time, much worse than if any one 
>>>>>> of these features individually has problems. If you break TCM in the 
>>>>>> right way the cluster could become inoperable, or operations like 
>>>>>> topology changes may be prevented. 
>>>>>> 
>>>>>> Benedict, when you say this are you speaking hypothetically (in the 
>>>>>> sense that by using TCM more we increase the probability of using it 
>>>>>> "wrong" and hitting an unknown edge case) or are there known ways today 
>>>>>> that TCM "breaks"?  
>>>>>> 
>>>>>> Jordan
>>>>>>  
>>>>>> This means that even a parallel log has some risk if we end up modifying 
>>>>>> shared functionality.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 20 Dec 2024, at 18:47, Štefan Miklošovič <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> I stand corrected. C in TCM is "cluster" :D Anyway. Configuration is 
>>>>>>> super reasonable to be put there.
>>>>>>> 
>>>>>>> On Fri, Dec 20, 2024 at 7:42 PM Štefan Miklošovič 
>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>> I am super hesitant to base distributed guardrails or any configuration 
>>>>>>> for that matter on anything but TCM. Does not "C" in TCM stand for 
>>>>>>> "configuration" anyway? So rename it to TSM like "schema" then if it is 
>>>>>>> meant to be just for that. It seems to be quite ridiculous to code 
>>>>>>> tables with caches on top when we have way more effective tooling 
>>>>>>> thanks to CEP-21 to deal with that with clear advantages of getting rid 
>>>>>>> of all of that old mechanism we have in place.
>>>>>>> 
>>>>>>> I have not seen any concrete examples of risks why using TCM should be 
>>>>>>> just for what it is currently for. Why not put the configuration meant 
>>>>>>> to be cluster-wide into that?
>>>>>>> 
>>>>>>> What is it ... performance? What does even the term "additional 
>>>>>>> complexity" mean? Complex in what? Do you think that putting there 3 
>>>>>>> types of transformations in case of guardrails which flip some booleans 
>>>>>>> and numbers would suddenly make TCM way more complex? Come on ...
>>>>>>> 
>>>>>>> This has nothing to do with what Jordan is trying to introduce. I think 
>>>>>>> we all agree he knows what he is doing and if he evaluates that TCM is 
>>>>>>> too much for his use case (or it is not a good fit) that is perfectly 
>>>>>>> fine. 
>>>>>>> 
>>>>>>> On Fri, Dec 20, 2024 at 7:22 PM Paulo Motta <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> > It should be possible to use distributed system tables just fine for 
>>>>>>> > capabilities, config and guardrails.
>>>>>>> 
>>>>>>> I have been thinking about this recently and I agree we should be wary 
>>>>>>> about introducing new TCM states and create additional complexity that 
>>>>>>> can be serviced by existing data dissemination mechanisms 
>>>>>>> (gossip/system tables). I would prefer that we take a more phased and 
>>>>>>> incremental approach to introduce new TCM states.
>>>>>>> 
>>>>>>> As a way to accomplish that, I have thought about introducing a new 
>>>>>>> generic TCM state "In Maintenance", where schema or membership changes 
>>>>>>> are "frozen/disallowed" while an external operation is taking place. 
>>>>>>> This "external operation" could mean many things:
>>>>>>> - Upgrade
>>>>>>> - Downgrade
>>>>>>> - Migration
>>>>>>> - Capability Enablement/Disablement
>>>>>>> 
>>>>>>> These could be sub-states of the "Maintenance" TCM state, that could be 
>>>>>>> managed externally (via cache/gossip/system tables/sidecar). Once these 
>>>>>>> sub-states are validated thouroughly and mature enough, we could 
>>>>>>> "promote" them to top-level TCM states.
>>>>>>> 
>>>>>>> In the end what really matters is that cluster and schema membership 
>>>>>>> changes do not happen while a miscellaneous operation is taking place.
>>>>>>> 
>>>>>>> Would this make sense as an initial way to integrate TCM with 
>>>>>>> capabilities framework ?
>>>>>>> 
>>>>>>> On Fri, Dec 20, 2024 at 4:53 AM Benedict <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> If you perform a read from a distributed table on startup you will find 
>>>>>>> the latest information. What catchup are you thinking of? I don’t think 
>>>>>>> any of the features we talked about need a log, only the latest 
>>>>>>> information.
>>>>>>> 
>>>>>>> We can (and should) probably introduce event listeners for distributed 
>>>>>>> tables, as this is also a really great feature, but I don’t think this 
>>>>>>> should be necessary here.
>>>>>>> 
>>>>>>> Regarding disagreements: if you use LWTs then there are no consistency 
>>>>>>> issues to worry about.
>>>>>>> 
>>>>>>> Again, I’m not opposed to using TCM, although I am a little worried TCM 
>>>>>>> is becoming our new hammer with everything a nail. It would be better 
>>>>>>> IMO to keep TCM scoped to essential functionality as it’s critical to 
>>>>>>> correctness. Perhaps we could extend its APIs to less critical services 
>>>>>>> without intertwining them with membership, schema and epoch handling.
>>>>>>> 
>>>>>>> 
>>>>>>>> On 20 Dec 2024, at 09:43, Štefan Miklošovič <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> I find TCM way more comfortable to work with. The capability of log 
>>>>>>>> being replayed on restart and catching up with everything else 
>>>>>>>> automatically is god-sent. If we had that on "good old distributed 
>>>>>>>> tables", then is it not true that we would need to take extra care of 
>>>>>>>> that, e.g. we would need to repair it etc ... It might be the source 
>>>>>>>> of the discrepancies / disagreements etc. TCM is just 
>>>>>>>> "maintenance-free" and _just works_.
>>>>>>>> 
>>>>>>>> I think I was also investigating distributed tables but was just 
>>>>>>>> pulled towards TCM naturally because of its goodies.
>>>>>>>> 
>>>>>>>> On Fri, Dec 20, 2024 at 10:08 AM Benedict <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> TCM is a perfectly valid basis for this, but TCM is only really 
>>>>>>>> *necessary* to solve meta config problems where we can’t rely on the 
>>>>>>>> rest of the database working. Particularly placement issues, which is 
>>>>>>>> why schema and membership need to live there.
>>>>>>>> 
>>>>>>>> It should be possible to use distributed system tables just fine for 
>>>>>>>> capabilities, config and guardrails.
>>>>>>>> 
>>>>>>>> That said, it’s possible config might be better represented as part of 
>>>>>>>> the schema (and we already store some relevant config there) in which 
>>>>>>>> case it would live in TCM automatically. Migrating existing configs to 
>>>>>>>> a distributed setup will be fun however we do it though.
>>>>>>>> 
>>>>>>>> Capabilities also feel naturally related to other membership 
>>>>>>>> information, so TCM might be the most suitable place, particularly for 
>>>>>>>> handling downgrades after capabilities have been enabled (if we ever 
>>>>>>>> expect to support turning off capabilities and then downgrading - 
>>>>>>>> which today we mostly don’t).
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 20 Dec 2024, at 08:42, Štefan Miklošovič <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> Jordan,
>>>>>>>>> 
>>>>>>>>> I also think that having it on TCM would be ideal and we should 
>>>>>>>>> explore this path first before doing anything custom.
>>>>>>>>> 
>>>>>>>>> Regarding my idea about the guardrails in TCM, when I prototyped that 
>>>>>>>>> and wanted to make it happen, there was a little bit of a pushback 
>>>>>>>>> (1) (even though super reasonable one) that TCM is just too young at 
>>>>>>>>> the moment and it would be desirable to go through some stabilisation 
>>>>>>>>> period.
>>>>>>>>> 
>>>>>>>>> Another idea was that we should not make just guardrails happen but 
>>>>>>>>> the whole config should be in TCM. From what I put together, Sam / 
>>>>>>>>> Alex does not seem to be opposed to this idea, rather the opposite, 
>>>>>>>>> but having CEP about that is way more involved than having just 
>>>>>>>>> guardrails there. I consider guardrails to be kind of special and I 
>>>>>>>>> do not think that having all configurations in TCM (which guardrails 
>>>>>>>>> are part of) is the absolute must in order to deliver that. I may 
>>>>>>>>> start with guardrails CEP and you may explore Capabilities CEP on TCM 
>>>>>>>>> too, if that makes sense?
>>>>>>>>> 
>>>>>>>>> I just wanted to raise the point about the time this would be 
>>>>>>>>> delivered. If Capabilities are built on TCM and I wanted to do 
>>>>>>>>> Guardrails on TCM too but was explained it is probably too soon, I 
>>>>>>>>> guess you would experience something similar.
>>>>>>>>> 
>>>>>>>>> Sam's comment is from May and maybe a lot has changed since in then 
>>>>>>>>> and his comment is not applicable anymore. It would be great to know 
>>>>>>>>> if we could build on top of the current trunk already or we will wait 
>>>>>>>>> until 5.1/6.0 is delivered.
>>>>>>>>> 
>>>>>>>>> (1) 
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-19593?focusedCommentId=17844326&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17844326
>>>>>>>>> 
>>>>>>>>> On Fri, Dec 20, 2024 at 2:17 AM Jordan West <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> Firstly, glad to see the support and enthusiasm here and in the 
>>>>>>>>> recent Slack discussion. I think there is enough for me to start 
>>>>>>>>> drafting a CEP.
>>>>>>>>>  
>>>>>>>>> Stefan, global configuration and capabilities do have some overlap 
>>>>>>>>> but not full overlap. For example, you may want to set globally that 
>>>>>>>>> a cluster enables feature X or control the threshold for a guardrail 
>>>>>>>>> but you still need to know if all nodes support feature X or have 
>>>>>>>>> that guardrail, the latter is what capabilities targets. I do think 
>>>>>>>>> capabilities are a step towards supporting global configuration and 
>>>>>>>>> the work you described is another step (that we could do after 
>>>>>>>>> capabilities or in parallel with them in mind). I am also supportive 
>>>>>>>>> of exploring global configuration for the reasons you mentioned. 
>>>>>>>>>  
>>>>>>>>> In terms of how capabilities get propagated across the cluster, I 
>>>>>>>>> hadn't put much thought into it yet past likely TCM since this will 
>>>>>>>>> be a new feature that lands after TCM. In Riak, we had gossip (but 
>>>>>>>>> more mature than C*s -- this was an area I contributed to a lot so 
>>>>>>>>> very familiar) to disseminate less critical information such as 
>>>>>>>>> capabilities and a separate layer that did TCM. Since we don't have 
>>>>>>>>> this in C* I don't think we would want to build a separate 
>>>>>>>>> distribution channel for capabilities metadata when we already have 
>>>>>>>>> TCM in place. But I plan to explore this more as I draft the CEP.
>>>>>>>>> 
>>>>>>>>> Jordan
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 19, 2024 at 1:48 PM Štefan Miklošovič 
>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>> Hi Jordan,
>>>>>>>>> 
>>>>>>>>> what would this look like from the implementation perspective? I was 
>>>>>>>>> experimenting with transactional guardrails where an operator would 
>>>>>>>>> control the content of a virtual table which would be backed by TCM 
>>>>>>>>> so whatever guardrail we would change, this would be automatically 
>>>>>>>>> and transparently propagated to every node in a cluster. The POC 
>>>>>>>>> worked quite nicely. TCM is just a vehicle to commit a change which 
>>>>>>>>> would spread around and all these settings would survive restarts. We 
>>>>>>>>> would have the same configuration everywhere which is not currently 
>>>>>>>>> the case because guardrails are configured per node and if not 
>>>>>>>>> persisted to yaml, on restart their values would be forgotten.
>>>>>>>>> 
>>>>>>>>> Guardrails are just an example, what is quite obvious is to expand 
>>>>>>>>> this idea to the whole configuration in yaml. Of course, not all 
>>>>>>>>> properties in yaml make sense to be the same cluster-wise (ip 
>>>>>>>>> addresses etc ...), but the ones which do would be again set 
>>>>>>>>> everywhere the same way.
>>>>>>>>> 
>>>>>>>>> The approach I described above is that we make sure that the 
>>>>>>>>> configuration is same everywhere, hence there can be no 
>>>>>>>>> misunderstanding what features this or that node has, if we say that 
>>>>>>>>> all nodes have to have a particular feature because we said so in TCM 
>>>>>>>>> log so on restart / replay, a node with "catch up" with whatever 
>>>>>>>>> features it is asked to turn on.
>>>>>>>>> 
>>>>>>>>> Your approach seems to be that we distribute what all capabilities / 
>>>>>>>>> features a cluster supports and that each individual node configures 
>>>>>>>>> itself in some way or not to comply?
>>>>>>>>> 
>>>>>>>>> Is there any intersection in these approaches? At first sight it 
>>>>>>>>> seems somehow related. How is one different from another from your 
>>>>>>>>> point of view?
>>>>>>>>> 
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-19593
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 19, 2024 at 12:00 AM Jordan West <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> In a recent discussion on the pains of upgrading one topic that came 
>>>>>>>>> up is a feature that Riak had called Capabilities [1]. A major pain 
>>>>>>>>> with upgrades is that each node independently decides when to start 
>>>>>>>>> using new or modified functionality. Even when we put this behind a 
>>>>>>>>> config (like storage compatibility mode) each node immediately 
>>>>>>>>> enables the feature when the config is changed and the node is 
>>>>>>>>> restarted. This causes various types of upgrade pain such as failed 
>>>>>>>>> streams and schema disagreement. A recent example of this is 
>>>>>>>>> CASSANRA-20118 [2]. In some cases operators can prevent this from 
>>>>>>>>> happening through careful coordination (e.g. ensuring upgrade 
>>>>>>>>> sstables only runs after the whole cluster is upgraded) but typically 
>>>>>>>>> requires custom code in whatever control plane the operator is using. 
>>>>>>>>> A capabilities framework would distribute the state of what features 
>>>>>>>>> each node has (and their status e.g. enabled or not) so that the 
>>>>>>>>> cluster can choose to opt in to new features once the whole cluster 
>>>>>>>>> has them available. From experience, having this in Riak made 
>>>>>>>>> upgrades a significantly less risky process and also paved a path 
>>>>>>>>> towards repeatable downgrades. I think Cassandra would benefit from 
>>>>>>>>> it as well.
>>>>>>>>>   
>>>>>>>>> Further, other tools like analytics could benefit from having this 
>>>>>>>>> information since currently it's up to the operator to manually 
>>>>>>>>> determine the state of the cluster in some cases. 
>>>>>>>>> 
>>>>>>>>> I am considering drafting a CEP proposal for this feature but wanted 
>>>>>>>>> to take the general temperature of the community and get some early 
>>>>>>>>> thoughts while working on the draft. 
>>>>>>>>> 
>>>>>>>>> Looking forward to hearing y'alls thoughts,
>>>>>>>>> Jordan
>>>>>>>>> 
>>>>>>>>> [1] 
>>>>>>>>> https://github.com/basho/riak_core/blob/25d9a6fa917eb8a2e95795d64eb88d7ad384ed88/src/riak_core_capability.erl#L23-L72
>>>>>>>>> 
>>>>>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-20118

Re: Capabilities

Reply via email to