Ok ok, there's a number of strong arguments to keep sstable formats around for much longer than the previous major Cassandra version, I will unset fixVersion on 18312 :-)
Taking a look at the history of sstable formats. They were first introduced in version 0.7, and minor versions introduced in version 1.0.3 with hb. Looking at when we have dropped support and cleaned up the code for past formats. - Versions before 1.2.5: formats <=ib; were removed in CASSANDRA-5511 https://github.com/apache/cassandra/commit/7f2c3a8e40f97c626def5c510d77c1da3d9ae926 - Version 1.2.5: format ic; were remove in CASSANDRA-6869 https://github.com/apache/cassandra/commit/8e172c8563a995808a72a1a7e81a06f3c2a355ce - All pre-3.0 formats were removed in CASSANDRA-12716 https://github.com/apache/cassandra/commit/4a2464192e9e69457f5a5ecf26c094f9298bf069 Saying that dropping the n* formats right now is such a small reduction in code, roughly double the size of 6869's patch, I agree with. Saying that there is never any complexity and we should keep formats in perpetuity, and I'm sitting here having a heart attack, srsly. I can also appreciate coming up with a good rule of thumb in advance is difficult when we just don't know how many formats there will be and what they will introduce. >From Aleksey: > But it’s one thing to require a two rolling restarts (3.0 to 4.0, 4.0 to 5.0), it’s another to require the operator to upgrade every single m* sstable to n*. Good point. Though I _always_ recommend users upgrade all sstables, before and after every major upgrade. But I recognise how easy it is to forget or err in that process, and we don't need to punish operators unnecessarily. Also worth noting since 4.x we have `automatic_sstable_upgrade` (which is wisely false by default). Question/Suggestion: should we improve gossip to include what the oldest format a node has, and ensure newer versioned node joining fail/warn if it does not support that older format? That is, should we give a clear signal back to operators that their rolling upgrade is not going to work smoothly, that they are going to hit nodes they will need to stop and do upgradesstables on (leaving them in a state of mix-versions and nodes busy upgrading…) >From Scott: > To expand on the final point he makes re: requiring SSTables be fully > rewritten prior to rev'ing from 4.x to 5.x (if the cluster previously ran > 3.x) – > > This would also invalidate incremental backups. Operators would either be > required to perform a full snapshot backup of each cluster to object > storage prior to upgrading from 4.x to 5.x; or to enumerate the contents of > all snapshots from an incremental backup series to ensure that no m*-series > SSTables were present prior to upgrading. > > If one failed to take on the work to do so, incremental backup snapshots > would not be restorable to a 5.x cluster if an m*-series SSTable were > present. > > Again, I would always recommend a backup before each major upgrade, and I would think this has become standard advice. On sstables residing in storage, and the need to do a full backup, that's another good point, but which I think we might solve in a smarter way (see below). >From Aleksey: > 2. It’s very stable and battle tested at this point > > I beg to differ on this. We don't test it, and upgrade code gets limited production time. And I bet operators are less incentivised to file bug reports on upgrade issues so long as they get through the upgrade one way or another (and I bet many issues pop up why too late, like the numerous range tombstone issues over many 3.11.x versions). We could be testing it more, and IMHO we should… > 5. There are third-party tools that I know of which benefit from a single > C* jar that can read all relevant stable versions, and relevant here > includes 3.0 ones > > I suggest we should have a way to read/write from/to all sstable versions, I absolutely agree this is useful (e.g. backups in storage). And we should be better at thorough testing. With such use-cases only applying only to node-local and offline scenarios, we can tackle this cross-branch, i.e. take the best of both worlds: simpler _tested_ code, and forward (and hopefully backward) compatibility _into perpetuity_. One example of this is if we could stream sstableupgrades, e.g. ``` # read from disk any l* sstables, write to disk latest m format sstableupgrade-3.11 --stream-output -f jb-1-big-Data.db | sstableupgrade-5.0 --stream-input ``` Sure, this is no longer "single C* jar", but that seems a minor trade-off to get something better. The idea of cross-branch functionality and testing is nothing new to us (e.g. jvm dtests). Note, this approach would likely be slower unless you threw cpu+mem at it. And it is applicable regardless of what the format compatibility policy we decide… The suggestion, even if it's only a strawman, raises some other questions … - Why doesn't sstableupgrade today upgrade sstables in parallel, or take a file argument so the operator can parallelise it? It seems that we're wasting a lot of time and resources (while the node is offline!), and quite possibly reducing cluster availability for longer-than-necessary periods. - And, if the recommended approach is online (`nodetool upgradesstables`) then shouldn't we move the sstableupgrade script to tools/bin/ ?