Just a heads-up, but there have been issues (at least one) reported when upgrading a multi-DC cluster from 3.x to 4.x when the cluster uses node-to-node SSL/TLS encryption. This is largely attributed to the fact that the secure port in 4.x changes to 9142, whereas in 3.x it continues to run on 9042 (same as non-SSL/TLS).
On Thu, Oct 26, 2023 at 2:03 PM Sebastian Marsching <sebast...@marsching.com> wrote: > Hi, > > as we are currently facing the same challenge (upgrading an existing > cluster from C* 3 to C* 4), I wanted to share our strategy with you. It > largely is what Scott already suggested, but I have some extra details, so > I thought it might still be useful. > > We duplicated our cluster using the strategy described at > http://adamhutson.com/cloning-cassandra-clusters-the-fast-way/. Of course > it is possible to figure out all the steps on your own, but I feel like > this detailed guide saved me at least a few hours, if not days. Instead of > restoring from a backup, we chose to create a snapshot on the live nodes > and copy the data from there, but this does not really change the overall > process. > > We only run a single data-center cluster, but I think that this process > easily translates to a multi data-center setup. In this case, you can > choose to only clone a single data center or you can clone a few or all of > them, if you deem this to be necessary for your tests. The only > “limitation” is that for each data center that you clone, you need exactly > the same number of nodes in your test cluster that you have in the > respective data center of your production cluster. > > Once the cluster is cloned, you can test whatever you like (e.g. upgrade > to C* 4, test operations in a mixed-version cluster, etc.). > > Our experience with the upgrade from C* 3.11 to C* 4.1 on the test cluster > was quite smooth. The only problem that we saw was that when later adding a > second data center to the test cluster, we got a lot of > CorruptSSTableExceptions on one of the nodes in the existing data center. > We first attributed this to the upgrade, but later we found out that this > also happens when running on C* 3.11. > > We now believe that the hardware of one of the nodes that we used for the > test cluster has a defect, because the exceptions were limited to this > exact node, even after moving data around. It just took us a while to > figure this out, because the hardware for the test cluster was brand new, > so “broken hardware” wasn’t our first guess. We are still in the process of > definitely proving that this specific piece of hardware is broken, but we > are now sufficiently confident in the stability of C* 4, that we are soon > going to move forward with upgrading the production cluster. > > -Sebastian > >