> Saying that there is never any complexity and we should keep formats in 
> perpetuity, and I'm sitting here having a heart attack, srsly. 

Nobody is claiming that. Don’t let a straw man give you a heart attack.

> Though I _always_ recommend users upgrade all sstables, before and after 
> every major upgrade.  But I recognise how easy it is to forget or err in that 
> process, and we don't need to punish operators unnecessarily. 

Bit arrogant to call this an error. An operator might have good reasons to not 
do this. Clusters can be quite large, after all.

> Again, I would always recommend a backup before each major upgrade, and I 
> would think this has become standard advice.  On sstables residing in 
> storage, and the need to do a full backup, that's another good point, but 
> which I think we might solve in a smarter way (see below).

See above. Also, not seeing the follow up to “see below”.

> I beg to differ on this. We don't test it, and upgrade code gets limited 
> production time.

The code to read -m* sstables has been heavily battle-tested. Luckily for the 
project, there are users who test this *very* thoroughly before upgrading, who 
are incentivised to file bug reports, and are very much capable of fixing them 
while at it.

As for precedent - we (including me) have done a lot of stupid shit over the 
years on this project. Half the time “this is how we’ve historically done X” to 
me is a strong argument to start doing things differently. This is one such 
case.
 
—
AY

> On 17 Mar 2023, at 14:24, Mick Semb Wever <m...@apache.org> wrote:
> 
> 
> Ok ok, there's a number of strong arguments to keep sstable formats around 
> for much longer than the previous major Cassandra version, I will unset 
> fixVersion on 18312  :-)   
> 
> 
> Taking a look at the history of sstable formats. They were first introduced 
> in version 0.7, and minor versions introduced in version 1.0.3 with hb.
> 
> Looking at when we have dropped support and cleaned up the code for past 
> formats.
> 
>  - Versions before 1.2.5: formats <=ib; were removed in CASSANDRA-5511
> https://github.com/apache/cassandra/commit/7f2c3a8e40f97c626def5c510d77c1da3d9ae926
> 
>  - Version 1.2.5: format ic; were remove in CASSANDRA-6869
> https://github.com/apache/cassandra/commit/8e172c8563a995808a72a1a7e81a06f3c2a355ce
> 
>  - All pre-3.0 formats were removed in CASSANDRA-12716 
> https://github.com/apache/cassandra/commit/4a2464192e9e69457f5a5ecf26c094f9298bf069
>  
> 
> Saying that dropping the n* formats right now is such a small reduction in 
> code, roughly double the size of 6869's patch, I agree with.  Saying that 
> there is never any complexity and we should keep formats in perpetuity, and 
> I'm sitting here having a heart attack, srsly.  I can also appreciate coming 
> up with a good rule of thumb in advance is difficult when we just don't know 
> how many formats there will be and what they will introduce.
> 
> 
> From Aleksey:
> > But it’s one thing to require a two rolling restarts (3.0 to 4.0, 4.0 to 
> > 5.0), it’s another to require the operator to upgrade every single m* 
> > sstable to n*. 
> 
> 
> Good point. 
> 
> Though I _always_ recommend users upgrade all sstables, before and after 
> every major upgrade.  But I recognise how easy it is to forget or err in that 
> process, and we don't need to punish operators unnecessarily.  Also worth 
> noting since 4.x we have `automatic_sstable_upgrade` (which is wisely false 
> by default).
> 
> Question/Suggestion: should we improve gossip to include what the oldest 
> format a node has, and ensure newer versioned node joining fail/warn if it 
> does not support that older format?  That is, should we give a clear signal 
> back to operators that their rolling upgrade is not going to work smoothly, 
> that they are going to hit nodes they will need to stop and do 
> upgradesstables on (leaving them in a state of mix-versions and nodes busy 
> upgrading…)
> 
> 
> From Scott:
>> To expand on the final point he makes re: requiring SSTables be fully 
>> rewritten prior to rev'ing from 4.x to 5.x (if the cluster previously ran 
>> 3.x) –
>> 
>> This would also invalidate incremental backups. Operators would either be 
>> required to perform a full snapshot backup of each cluster to object storage 
>> prior to upgrading from 4.x to 5.x; or to enumerate the contents of all 
>> snapshots from an incremental backup series to ensure that no m*-series 
>> SSTables were present prior to upgrading.
>> 
>> If one failed to take on the work to do so, incremental backup snapshots 
>> would not be restorable to a 5.x cluster if an m*-series SSTable were 
>> present.
> 
> Again, I would always recommend a backup before each major upgrade, and I 
> would think this has become standard advice.  On sstables residing in 
> storage, and the need to do a full backup, that's another good point, but 
> which I think we might solve in a smarter way (see below).
> 
>  
> From Aleksey:
>>> 2. It’s very stable and battle tested at this point
> 
> 
> I beg to differ on this. We don't test it, and upgrade code gets limited 
> production time.  And I bet operators are less incentivised to file bug 
> reports on upgrade issues so long as they get through the upgrade one way or 
> another (and I bet many issues pop up why too late, like the numerous range 
> tombstone issues over many 3.11.x versions).
> 
> We could be testing it more, and IMHO we should…
> 
>  
>>> 5. There are third-party tools that I know of which benefit from a single 
>>> C* jar that can read all relevant stable versions, and relevant here 
>>> includes 3.0 ones
> 
>  
> I suggest we should have a way to read/write from/to all sstable versions, I 
> absolutely agree this is useful (e.g. backups in storage). And we should be 
> better at thorough testing. 
> 
> With such use-cases only applying only to node-local and offline scenarios, 
> we can tackle this cross-branch, i.e. take the best of both worlds: simpler 
> _tested_ code, and forward (and hopefully backward) compatibility _into 
> perpetuity_. 
> 
> One example of this is if we could stream sstableupgrades, e.g.
> ```
>    # read from disk any l* sstables, write to disk latest m format
>    sstableupgrade-3.11 --stream-output -f jb-1-big-Data.db  | 
> sstableupgrade-5.0 --stream-input 
> ```
> Sure, this is no longer "single C* jar", but that seems a minor trade-off to 
> get something better. The idea of cross-branch functionality and testing is 
> nothing new to us (e.g. jvm dtests). Note, this approach would likely be 
> slower unless you threw cpu+mem at it. And it is applicable regardless of 
> what the format compatibility policy we decide… 
> 
> The suggestion, even if it's only a strawman, raises some other questions …
> 
> - Why doesn't sstableupgrade today upgrade sstables in parallel, or take a 
> file argument so the operator can parallelise it? It seems that we're wasting 
> a lot of time and resources (while the node is offline!), and quite possibly 
> reducing cluster availability for longer-than-necessary periods.
> 
> - And, if the recommended approach is online (`nodetool upgradesstables`) 
> then shouldn't we move the sstableupgrade script to tools/bin/ ? 
> 
>  

Reply via email to