Ok ok, there's a number of strong arguments to keep sstable formats around
for much longer than the previous major Cassandra version, I will unset
fixVersion on 18312  :-)


Taking a look at the history of sstable formats. They were first introduced
in version 0.7, and minor versions introduced in version 1.0.3 with hb.

Looking at when we have dropped support and cleaned up the code for past
formats.

 - Versions before 1.2.5: formats <=ib; were removed in CASSANDRA-5511
https://github.com/apache/cassandra/commit/7f2c3a8e40f97c626def5c510d77c1da3d9ae926

 - Version 1.2.5: format ic; were remove in CASSANDRA-6869
https://github.com/apache/cassandra/commit/8e172c8563a995808a72a1a7e81a06f3c2a355ce

 - All pre-3.0 formats were removed in CASSANDRA-12716
https://github.com/apache/cassandra/commit/4a2464192e9e69457f5a5ecf26c094f9298bf069


Saying that dropping the n* formats right now is such a small reduction in
code, roughly double the size of 6869's patch, I agree with.  Saying that
there is never any complexity and we should keep formats in perpetuity, and
I'm sitting here having a heart attack, srsly.  I can also appreciate
coming up with a good rule of thumb in advance is difficult when we just
don't know how many formats there will be and what they will introduce.


>From Aleksey:
> But it’s one thing to require a two rolling restarts (3.0 to 4.0, 4.0 to
5.0), it’s another to require the operator to upgrade every single m*
sstable to n*.


Good point.

Though I _always_ recommend users upgrade all sstables, before and after
every major upgrade.  But I recognise how easy it is to forget or err in
that process, and we don't need to punish operators unnecessarily.  Also
worth noting since 4.x we have `automatic_sstable_upgrade` (which is wisely
false by default).

Question/Suggestion: should we improve gossip to include what the oldest
format a node has, and ensure newer versioned node joining fail/warn if it
does not support that older format?  That is, should we give a clear signal
back to operators that their rolling upgrade is not going to work smoothly,
that they are going to hit nodes they will need to stop and do
upgradesstables on (leaving them in a state of mix-versions and nodes busy
upgrading…)


>From Scott:

> To expand on the final point he makes re: requiring SSTables be fully
> rewritten prior to rev'ing from 4.x to 5.x (if the cluster previously ran
> 3.x) –
>
> This would also invalidate incremental backups. Operators would either be
> required to perform a full snapshot backup of each cluster to object
> storage prior to upgrading from 4.x to 5.x; or to enumerate the contents of
> all snapshots from an incremental backup series to ensure that no m*-series
> SSTables were present prior to upgrading.
>
> If one failed to take on the work to do so, incremental backup snapshots
> would not be restorable to a 5.x cluster if an m*-series SSTable were
> present.
>
>
Again, I would always recommend a backup before each major upgrade, and I
would think this has become standard advice.  On sstables residing in
storage, and the need to do a full backup, that's another good point, but
which I think we might solve in a smarter way (see below).


>From Aleksey:

> 2. It’s very stable and battle tested at this point
>
>

I beg to differ on this. We don't test it, and upgrade code gets limited
production time.  And I bet operators are less incentivised to file bug
reports on upgrade issues so long as they get through the upgrade one way
or another (and I bet many issues pop up why too late, like the numerous
range tombstone issues over many 3.11.x versions).

We could be testing it more, and IMHO we should…



> 5. There are third-party tools that I know of which benefit from a single
> C* jar that can read all relevant stable versions, and relevant here
> includes 3.0 ones
>
>

I suggest we should have a way to read/write from/to all sstable versions,
I absolutely agree this is useful (e.g. backups in storage). And we should
be better at thorough testing.

With such use-cases only applying only to node-local and offline scenarios,
we can tackle this cross-branch, i.e. take the best of both worlds: simpler
_tested_ code, and forward (and hopefully backward) compatibility _into
perpetuity_.

One example of this is if we could stream sstableupgrades, e.g.
```
   # read from disk any l* sstables, write to disk latest m format
   sstableupgrade-3.11 --stream-output -f jb-1-big-Data.db  |
sstableupgrade-5.0 --stream-input
```
Sure, this is no longer "single C* jar", but that seems a minor trade-off
to get something better. The idea of cross-branch functionality and testing
is nothing new to us (e.g. jvm dtests). Note, this approach would likely be
slower unless you threw cpu+mem at it. And it is applicable regardless of
what the format compatibility policy we decide…

The suggestion, even if it's only a strawman, raises some other questions …

- Why doesn't sstableupgrade today upgrade sstables in parallel, or take a
file argument so the operator can parallelise it? It seems that we're
wasting a lot of time and resources (while the node is offline!), and quite
possibly reducing cluster availability for longer-than-necessary periods.

- And, if the recommended approach is online (`nodetool upgradesstables`)
then shouldn't we move the sstableupgrade script to tools/bin/ ?

Reply via email to