If we can get opt-in major format upgrades, as well as an offline sstabledowngrade tool, I think we have a good first step that would make downgrades possible.
Given Jacek’s work on the sstable format API, and the work from Yuki and Claude on old formats, I think we are pretty close to having both of those be viable? I think with the opt-in major format upgrades, the main thing will be to ensure that all new features that were built around the new format either fail gracefully, or for a change in behavior opt to the old behavior until the new format is available? If a new feature is using a feature flag this could be a simple check to throw a configuration exception if the feature is enabled, but the new sstable format is not available. No features have yet merged that bump the sstable major version, but a few are finishing up that will. Do we want to block merging those changes until discussions here finish? I don’t think that we need to? The ticket which brings in the ability to opt-in to the sstable format change can also fix up the existing code to check the flag? -Jeremiah > On Feb 21, 2023, at 10:29 AM, Benedict <bened...@apache.org> wrote: > > As always, Scott puts it much more eloquently than I can. > > The only thing I’d quibble with is that I think it is better to make changes > backwards compatible, rather than make earlier releases forwards compatible - > and where this is prohibitively costly to simply make a feature that depends > on it unavailable until the switch to the new major format. > > This provides the greatest flexibility for users, as they can upgrade from > and downgrade to the same versions. There’s no scrambling for a different > downgrade target you haven’t qualified when finding out there’s an > unacceptable bug. > There’s also less delta between pre-upgrade and post-downgrade behaviour. > > We have plenty of practice doing this kind of thing. It’s not that hard. > > But, if we want to go the forward compatibility route that’s still far better > than nothing. > > >> On 21 Feb 2023, at 16:17, C. Scott Andreas <sc...@paradoxica.net> wrote: >> >> >> I realize my feedback on this has been spread across tickets and older >> mailing list / wiki discussions, so I'll offer a proposal here. >> >> Starting with goals - >> >> 1. Cassandra users must be able to abort and revert an upgrade to a new >> version of the database that introduces a new major SSTable format. >> >> This reduces risk of upgrading to a build that also introduces a >> non-data-format-related bug that is intolerable. This goal does not specify >> a mechanism or downgrade target - just the "downgradability" goal. >> >> 2. Where possible, Cassandra users should be able to opt into writing of a >> new major SSTable format. >> >> This reduces that risk further by allowing users to decouple data format >> changes from the upgrade itself. There may be cases where new features or >> bug fixes prevent this from being possible, but I'll offer it as a goal. >> >> 3. It should be possible for users to perform the downgrade in-place by >> launching the database using a previous version's binary. >> >> This avoids the need for complex orchestration of offline commands like a >> hypothetical `downgradesstables`. >> >> >> The following approach would allow us to accomplish these goals: >> >> 1. Major SSTable changes should begin with forward-compatibility in a prior >> release. >> >> In a release prior to one that revs major SSTable versions, we should >> implement the ability to read the SSTables that we intend to write in the >> next major version. This would allow someone to (eg.,) revert from 5.0 to >> 4.2 if they encountered a regression that caused an outage without data >> loss. This downgrade path should be well-specified and called out in >> NEWS.txt. >> >> 2. Where possible, major SSTable format changes should be opt-in (if the >> features / bugfixes introduced allow). >> >> This would be via a flag to enable writing the new format once an operator >> has determined that post-upgrade their clusters are sufficiently stable. >> This is an approach that HDFS has adopted. Following a rolling upgrade of >> HDFS, downgrade remains possible until an operator executes a "finalize" >> operation to migrate NameNode metadata to the new version's. An approach >> like this would allow users to perform a staged upgrade in which they first >> rev the version of the database, followed by opting into its new format to >> derisk (eg.,) adoption of BTI-indexed SSTables. >> >> These approaches aren't meant to discourage SSTable format evolution - but >> to make it safer, and ideally faster. They don't specify duplicative >> serialization or a game of Twister to hide fields in locations where old >> versions don't think to look. Forward compatibility in a prior release could >> be landed at the same time as the major format revision itself, so long as >> we cut releases from both branches. >> >> Ability to back out an upgrade until finalized would dramatically lower the >> risk of adopting new releases of Apache Cassandra. For many users, the >> qualification cycle for a new release is more than a year - and a *lot* of >> work. >> >> Reducing the risk of upgrading to new releases repositions Cassandra as a >> database that can be treated with greater trust -- especially for >> multi-petabyte, mission critical systems. Our user community will advance to >> newer releases more quickly and we'll be able to shorten the maintenance >> cycles for older releases. In the same way that CI stability enables us to >> move faster and more confidently in the project, safety features like this >> will enable our users (and indeed ourselves) to move more confidently to >> adopt them. >> >> – Scott >> >> >>> On Feb 21, 2023, at 4:51 AM, "Claude Warren, Jr via dev" >>> <dev@cassandra.apache.org> wrote: >>> >>> >>> My goal in implementing CASSANDRA-8928 >>> <https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-8928> >>> was to be able to take the current version 4.x and write it as the >>> earliest 3.x version possible. The reasoning being that if that was >>> possible then whatever 3.x version was executed would be able to >>> automatically read the early 3.x version. My thought was that each release >>> version would have the ability to downgrade to the earliest previous >>> version. In this way if users need to they could string together a number >>> of downgrader versions to move from 5.x to 3.x. >>> >>> My testing has been pretty straightforward, I created 4 docker containers >>> using the standard published Cassandra docker images for 3.1 and 4.0 with >>> data mounted on an external drive. two of the containers (one of each >>> version) did not automatically start Cassandra. My process was then: >>> start and stop Cassandra 4.0 to create the default data files >>> start the Cassandra 4.0 container that does not automatically run Cassandra >>> and execute the new downgrade functionality. >>> start the Cassandra 3.1 container and dump the logs. If the system started >>> then I knew that I at least had a proof of concept. So far no-go. >>> >>> >>> On Tue, Feb 21, 2023 at 8:57 AM Branimir Lambov >>> <branimir.lam...@datastax.com <mailto:branimir.lam...@datastax.com>> wrote: >>>> It appears to me that the first thing we need to start this feature off is >>>> a definition of a suite of tests together with a set of rules to keep the >>>> suite up to date with new features as they are introduced. The moment that >>>> suite is in place, we can start having some confidence that we can enforce >>>> downgradability. >>>> >>>> Something like this will definitely catch incompatibilities in SSTable >>>> formats (such as the one in CASSANDRA-17698 that I managed to miss during >>>> review), but will also be able to identify incompatible system schema >>>> changes among others, and at the same time rightfully ignore non-breaking >>>> changes such as modifications to the key cache serialization formats. >>>> >>>> Is downgradability in scope for 5.0? It is a feature like any other, and I >>>> don't see any difficulty adding it (with support for downgrade to 4.x) a >>>> little later in the 5.x timeline. >>>> >>>> Regards, >>>> Branimir >>>> >>>> >>>> On Tue, Feb 21, 2023 at 9:40 AM Jacek Lewandowski >>>> <lewandowski.ja...@gmail.com <mailto:lewandowski.ja...@gmail.com>> wrote: >>>>> I'd like to mention CASSANDRA-17056 (CEP-17) here as it aims to introduce >>>>> multiple sstable formats support. It allows for providing an >>>>> implementation of SSTableFormat along with SSTableReader and >>>>> SSTableWriter. That could be extended easily to support different >>>>> implementations for certain version ranges, like one impl for ma-nz, >>>>> other for oa+, etc. without having a confusing implementation with a lot >>>>> of conditional blocks. Old formats in such case could be maintained >>>>> separately from the main code and easily switched any time. >>>>> >>>>> thanks >>>>> - - -- --- ----- -------- ------------- >>>>> Jacek Lewandowski >>>>> >>>>> >>>>> wt., 21 lut 2023 o 01:46 Yuki Morishita <yu...@apache.org >>>>> <mailto:yu...@apache.org>> napisał(a): >>>>>> Hi, >>>>>> >>>>>> What I wanted to address in my comment in >>>>>> CASSANDRA-8110(https://issues.apache.org/jira/browse/CASSANDRA-8110?focusedCommentId=17641705&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17641705) >>>>>> is to focus on better upgrade experience. >>>>>> >>>>>> Upgrading the cluster can be painful for some orgs with mission critical >>>>>> Cassandra cluster, where they cannot tolerate less availability because >>>>>> of the inability to replace the downed node. >>>>>> They also need to plan rolling back to the previous state when something >>>>>> happens along the way. >>>>>> The change I proposed in CASSANDRA-8110 is to achieve the goal of at >>>>>> least enabling SSTable streaming during the upgrade by not upgrading the >>>>>> SSTable version. This can make the cluster to easily rollback to the >>>>>> previous version. >>>>>> Downgrading SSTable is not the primary focus (though Cassandra needs to >>>>>> implement the way to write SSTable in older versions, so it is somewhat >>>>>> related.) >>>>>> >>>>>> I'm preparing the design doc for the change. >>>>>> Also, if I should create a separate ticket from CASSANDRA-8110 for the >>>>>> clarity of the goal of the change, please let me know. >>>>>> >>>>>> >>>>>> On Tue, Feb 21, 2023 at 5:31 AM Benedict <bened...@apache.org >>>>>> <mailto:bened...@apache.org>> wrote: >>>>>>> >>>>>>> FWIW I think 8110 is the right approach, even if it isn’t a panacea. We >>>>>>> will have to eventually also tackle system schema changes (probably not >>>>>>> hard), and may have to think a little carefully about other things, eg >>>>>>> with TTLs the format change is only the contract about what values can >>>>>>> be present, so we have to make sure the data validity checks are >>>>>>> consistent with the format we write. It isn’t as simple as writing an >>>>>>> earlier version in this case (unless we permit truncating the TTL, >>>>>>> perhaps) >>>>>>> >>>>>>> On 20 Feb 2023, at 20:24, Benedict <bened...@apache.org >>>>>>> <mailto:bened...@apache.org>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> In a self-organising community, things that aren’t self-policed >>>>>>>> naturally end up policed in an adhoc manner, and with difficulty. I’m >>>>>>>> not sure that’s the same as arbitrary enforcement. It seems to me the >>>>>>>> real issue is nobody noticed this was agreed and/or forgot and didn’t >>>>>>>> think about it much. >>>>>>>> >>>>>>>> But, even without any prior agreement, it’s perfectly reasonable to >>>>>>>> request that things do not break compatibility if they do not need to, >>>>>>>> as part of the normal patch integration process. >>>>>>>> >>>>>>>> Issues with 3.1->4.0 aren’t particularly relevant as they predate any >>>>>>>> agreement to do this. But we can and should address the problem of new >>>>>>>> columns in schema tables, as this happens often between versions. I’m >>>>>>>> not sure it has in 4.1 though? >>>>>>>> >>>>>>>> Regarding downgrade versions, surely this should simply be the same as >>>>>>>> upgrade versions we support? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 20 Feb 2023, at 20:02, Jeff Jirsa <jji...@gmail.com >>>>>>>>> <mailto:jji...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> I'm not even convinced even 8110 addresses this - just writing >>>>>>>>> sstables in old versions won't help if we ever add things like new >>>>>>>>> types or new types of collections without other control abilities. >>>>>>>>> Claude's other email in another thread a few hours ago talks about >>>>>>>>> some of these surprises - "Specifically during the 3.1 -> 4.0 changes >>>>>>>>> a column broadcast_port was added to system/local. This means that >>>>>>>>> 3.1 system can not read the table as it has no definition for it. I >>>>>>>>> tried marking the column for deletion in the metadata and in the >>>>>>>>> serialization header. The later got past the column not found >>>>>>>>> problem, but I suspect that it just means that data columns after >>>>>>>>> broadcast_port shifted and so incorrectly read." - this is a harder >>>>>>>>> problem to solve than just versioning sstables and network protocols. >>>>>>>>> >>>>>>>>> Stepping back a bit, we have downgrade ability listed as a goal, but >>>>>>>>> it's not (as far as I can tell) universally enforced, nor is it clear >>>>>>>>> at which point we will be able to concretely say "this release can be >>>>>>>>> downgraded to X". Until we actually define and agree that this is a >>>>>>>>> real goal with a concrete version where downgrade-ability becomes >>>>>>>>> real, it feels like things are somewhat arbitrarily enforced, which >>>>>>>>> is probably very frustrating for people trying to commit work/tickets. >>>>>>>>> >>>>>>>>> - Jeff >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Feb 20, 2023 at 11:48 AM Dinesh Joshi <djo...@apache.org >>>>>>>>> <mailto:djo...@apache.org>> wrote: >>>>>>>>>> I’m a big fan of maintaining backward compatibility. Downgradability >>>>>>>>>> implies that we could potentially roll back an upgrade at any time. >>>>>>>>>> While I don’t think we need to retain the ability to downgrade in >>>>>>>>>> perpetuity it would be a good objective to maintain strict backward >>>>>>>>>> compatibility and therefore downgradability until a certain point. >>>>>>>>>> This would imply versioning metadata and extending it in such a way >>>>>>>>>> that prior version(s) could continue functioning. This can certainly >>>>>>>>>> be expensive to implement and might bloat on-disk storage. However, >>>>>>>>>> we could always offer an option for the operator to optimize the >>>>>>>>>> on-disk structures for the current version then we can rewrite them >>>>>>>>>> in the latest version. This optimizes the storage and opens up new >>>>>>>>>> functionality. This means new features that can work with old >>>>>>>>>> on-disk structures will be available while others that strictly >>>>>>>>>> require new versions of the data structures will be unavailable >>>>>>>>>> until the operator migrates to the new version. This migration IMO >>>>>>>>>> should be irreversible. Beyond this point the operator will lose the >>>>>>>>>> ability to downgrade which is ok. >>>>>>>>>> >>>>>>>>>> Dinesh >>>>>>>>>> >>>>>>>>>>> On Feb 20, 2023, at 10:40 AM, Jake Luciani <jak...@gmail.com >>>>>>>>>>> <mailto:jak...@gmail.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> There has been progress on >>>>>>>>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-8928 >>>>>>>>>>> >>>>>>>>>>> Which is similar to what datastax does for DSE. Would this be an >>>>>>>>>>> acceptable solution? >>>>>>>>>>> >>>>>>>>>>> Jake >>>>>>>>>>> >>>>>>>>>>> On Mon, Feb 20, 2023 at 11:17 AM guo Maxwell <cclive1...@gmail.com >>>>>>>>>>> <mailto:cclive1...@gmail.com>> wrote: >>>>>>>>>>>> It seems “An alternative solution is to implement/complete >>>>>>>>>>>> CASSANDRA-8110 >>>>>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8110>” can give >>>>>>>>>>>> us more options if it is finished😉 >>>>>>>>>>>> >>>>>>>>>>>> Branimir Lambov <blam...@apache.org >>>>>>>>>>>> <mailto:blam...@apache.org>>于2023年2月20日 周一下午11:03写道: >>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>> >>>>>>>>>>>>> There has been a discussion lately about changes to the sstable >>>>>>>>>>>>> format in the context of being able to abort a cluster upgrade, >>>>>>>>>>>>> and the fact that changes to sstables can prevent downgraded >>>>>>>>>>>>> nodes from reading any data written during their temporary >>>>>>>>>>>>> operation with the new version. >>>>>>>>>>>>> >>>>>>>>>>>>> Most of the discussion is in CASSANDRA-18134 >>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-18134>, and is >>>>>>>>>>>>> spreading into CASSANDRA-14277 >>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-14227> and >>>>>>>>>>>>> CASSANDRA-17698 >>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17698>, none of >>>>>>>>>>>>> which is a good place to discuss the topic seriously. >>>>>>>>>>>>> >>>>>>>>>>>>> Downgradability is a worthy goal and is listed in the current >>>>>>>>>>>>> roadmap. I would like to open a discussion here on how it would >>>>>>>>>>>>> be achieved. >>>>>>>>>>>>> >>>>>>>>>>>>> My understanding of what has been suggested so far translates to: >>>>>>>>>>>>> - avoid changes to sstable formats; >>>>>>>>>>>>> - if there are changes, implement them in a way that is >>>>>>>>>>>>> backwards-compatible, e.g. by duplicating data, so that a new >>>>>>>>>>>>> version is presented in a component or portion of a component >>>>>>>>>>>>> that legacy nodes will not try to read; >>>>>>>>>>>>> - if the latter is not feasible, make sure the changes are only >>>>>>>>>>>>> applied if a feature flag has been enabled. >>>>>>>>>>>>> >>>>>>>>>>>>> To me this approach introduces several risks: >>>>>>>>>>>>> - it bloats file and parsing complexity; >>>>>>>>>>>>> - it discourages improvement (e.g. CASSANDRA-17698 is no longer a >>>>>>>>>>>>> LHF ticket once this requirement is in place); >>>>>>>>>>>>> - it needs care to avoid risky solutions to address technical >>>>>>>>>>>>> issues with the format versioning (e.g. staying on n-versions for >>>>>>>>>>>>> 5.0 and needing a bump for a 4.1 bugfix might require porting >>>>>>>>>>>>> over support for new features); >>>>>>>>>>>>> - it requires separate and uncoordinated solutions to the problem >>>>>>>>>>>>> and switching mechanisms for each individual change. >>>>>>>>>>>>> >>>>>>>>>>>>> An alternative solution is to implement/complete CASSANDRA-8110 >>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8110>, which >>>>>>>>>>>>> provides a method of writing sstables for a target version. >>>>>>>>>>>>> During upgrades, a node could be set to produce sstables >>>>>>>>>>>>> corresponding to the older version, and there is a very >>>>>>>>>>>>> straightforward way to implement modifications to formats like >>>>>>>>>>>>> the tickets above to conform to its requirements. >>>>>>>>>>>>> >>>>>>>>>>>>> What do people think should be the way forward? >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Branimir >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> you are the apple of my eye ! >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> http://twitter.com/tjake >>>> >>>> >>>> >>>> -- >>>> Branimir Lambov >>>> e. branimir.lam...@datastax.com <mailto:branimir.lam...@datastax.com> >>>> w. www.datastax.com <http://www.datastax.com/> >>>> >> >>