We have multiple tickets about to merge that introduce new on disk format changes. I see no reason to block those indefinitely while we figure out how to do the on disk format downgrade stuff.
-Jeremiah > On Feb 22, 2023, at 3:12 PM, Benedict <bened...@apache.org> wrote: > > Ok I will be honest, I was fairly sure we hadn’t yet broken downgrade - but I > was wrong. CASSANDRA-18061 introduced a new column to a system table, which > is a breaking change. > > But that’s it, as far as I can tell. I have run a downgrade test successfully > after reverting that ticket, using the one line patch below. This makes every > in-jvm upgrade test also a downgrade test. I’m sure somebody more familiar > with dtests can readily do the same there. > > While we look to fix 18061 and enable downgrade tests (and get a clean run of > the full suite), can we all agree not to introduce new breaking changes? > > > index e41444fe52..085b25f8af 100644 > --- > a/test/distributed/org/apache/cassandra/distributed/upgrade/UpgradeTestBase.java > +++ > b/test/distributed/org/apache/cassandra/distributed/upgrade/UpgradeTestBase.java > @@ -104,6 +104,7 @@ public class UpgradeTestBase extends DistributedTestBase > > .addEdge(v40, v41) > > .addEdge(v40, v42) > > .addEdge(v41, v42) > + > .addEdge(v42, v41) > > .build(); > > >> On 22 Feb 2023, at 15:08, Jeff Jirsa <jji...@gmail.com> wrote: >> >> When people are serious about this requirement, they’ll build the downgrade >> equivalents of the upgrade tests and run them automatically, often, so >> people understand what the real gap is and when something new makes it break >> >> Until those tests exist, I think collectively we should all stop pretending >> like this is dogma. Best effort is best effort. >> >> >> >>> On Feb 22, 2023, at 6:57 AM, Branimir Lambov <branimir.lam...@datastax.com> >>> wrote: >>> >>> >>> > 1. Major SSTable changes should begin with forward-compatibility in a >>> > prior release. >>> >>> This requires "feature" changes, i.e. new non-trivial code for previous >>> patch releases. It also entails porting over any further format >>> modification. >>> >>> Instead of this, in combination with your second point, why not implement >>> backwards write compatibility? The opt-in is then clearer to define (i.e. >>> upgrades start with e.g. a "4.1-compatible" settings set that includes file >>> format compatibility and disabling of new features, new nodes start with >>> "current" settings set). When the upgrade completes and the user is happy >>> with the result, the settings set can be replaced. >>> >>> Doesn't this achieve what you want (and we all agree is a worthy goal) with >>> much less effort for everyone? Supporting backwards-compatible writing is >>> trivial, and we even have a proof-of-concept in the stats metadata >>> serializer. It also simplifies by a serious margin the amount of work and >>> thinking one has to do when a format improvement is implemented -- e.g. the >>> TTL patch can just address this in exactly the way the problem was >>> addressed in earlier versions of the format, by capping to 2038, without >>> any need to specify, obey or test any configuration flags. >>> >>> >> It’s a commitment, and it requires every contributor to consider it as >>> >> part of work they produce. >>> >>> > But it shouldn't be a burden. Ability to downgrade is a testable problem, >>> > so I see this work as a function of the suite of tests the project is >>> > willing to agree on supporting. >>> >>> I fully agree with this sentiment, and I feel that the current "try to not >>> introduce breaking changes" approach is adding the burden, but not the >>> benefits -- because the latter cannot be proven, and are most likely >>> already broken. >>> >>> Regards, >>> Branimir >>> >>> On Wed, Feb 22, 2023 at 1:01 AM Abe Ratnofsky <a...@aber.io >>> <mailto:a...@aber.io>> wrote: >>>> Some interesting existing work on this subject is "Understanding and >>>> Detecting Software Upgrade Failures in Distributed Systems" - >>>> https://dl.acm.org/doi/10.1145/3477132.3483577 >>>> <https://urldefense.com/v3/__https://dl.acm.org/doi/10.1145/3477132.3483577__;!!PbtH5S7Ebw!ZUMhWOKjMaK62HKCGLYN0rAhZbbX8fOJkgCsfMgjYO5EgJQulefcb5pwH4q5oU5ylLl6W56W-NWm0FLO7w$>, >>>> also summarized by Andrey Satarin here: >>>> https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-systems/ >>>> >>>> <https://urldefense.com/v3/__https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-systems/__;!!PbtH5S7Ebw!ZUMhWOKjMaK62HKCGLYN0rAhZbbX8fOJkgCsfMgjYO5EgJQulefcb5pwH4q5oU5ylLl6W56W-NUfWWwFsA$> >>>> >>>> They specifically tested Cassandra upgrades, and have a solid list of >>>> defects that they found. They also describe their testing mechanism >>>> DUPTester, which includes a component that confirms that the leftover >>>> state from one version can start up on the next version. There is a wider >>>> scope of upgrade defects highlighted in the paper, beyond SSTable version >>>> support. >>>> >>>> I believe the project would benefit from expanding our test suite >>>> similarly, by parametrizing more tests on upgrade version pairs. >>>> >>>> Also, per Benedict's comment: >>>> >>>> > It’s a commitment, and it requires every contributor to consider it as >>>> > part of work they produce. >>>> >>>> But it shouldn't be a burden. Ability to downgrade is a testable problem, >>>> so I see this work as a function of the suite of tests the project is >>>> willing to agree on supporting. >>>> >>>> Specifically - I agree with Scott's proposal to emulate the HDFS >>>> upgrade-then-finalize approach. I would also support automatic >>>> finalization based on a time threshold or similar, to balance the >>>> priorities of safe and straightforward upgrades. Users need to be aware of >>>> the range of SSTable formats supported by a given version, and how to >>>> handle when their SSTables wouldn't be supported by an upcoming upgrade. >>>> >>>> -- >>>> Abe >>> >>> >>> -- >>> Branimir Lambov >>> e. branimir.lam...@datastax.com <mailto:branimir.lam...@datastax.com> >>> w. www.datastax.com <http://www.datastax.com/> >>>