Just a heads-up, but there have been issues (at least one) reported when
upgrading a multi-DC cluster from 3.x to 4.x when the cluster uses
node-to-node SSL/TLS encryption. This is largely attributed to the fact
that the secure port in 4.x changes to 9142, whereas in 3.x it continues to
run on 9042 (same as non-SSL/TLS).

On Thu, Oct 26, 2023 at 2:03 PM Sebastian Marsching <sebast...@marsching.com>
wrote:

> Hi,
>
> as we are currently facing the same challenge (upgrading an existing
> cluster from C* 3 to C* 4), I wanted to share our strategy with you. It
> largely is what Scott already suggested, but I have some extra details, so
> I thought it might still be useful.
>
> We duplicated our cluster using the strategy described at
> http://adamhutson.com/cloning-cassandra-clusters-the-fast-way/. Of course
> it is possible to figure out all the steps on your own, but I feel like
> this detailed guide saved me at least a few hours, if not days. Instead of
> restoring from a backup, we chose to create a snapshot on the live nodes
> and copy the data from there, but this does not really change the overall
> process.
>
> We only run a single data-center cluster, but I think that this process
> easily translates to a multi data-center setup. In this case, you can
> choose to only clone a single data center or you can clone a few or all of
> them, if you deem this to be necessary for your tests. The only
> “limitation” is that for each data center that you clone, you need exactly
> the same number of nodes in your test cluster that you have in the
> respective data center of your production cluster.
>
> Once the cluster is cloned, you can test whatever you like (e.g. upgrade
> to C* 4, test operations in a mixed-version cluster, etc.).
>
> Our experience with the upgrade from C* 3.11 to C* 4.1 on the test cluster
> was quite smooth. The only problem that we saw was that when later adding a
> second data center to the test cluster, we got a lot of
> CorruptSSTableExceptions on one of the nodes in the existing data center.
> We first attributed this to the upgrade, but later we found out that this
> also happens when running on C* 3.11.
>
> We now believe that the hardware of one of the nodes that we used for the
> test cluster has a defect, because the exceptions were limited to this
> exact node, even after moving data around. It just took us a while to
> figure this out, because the hardware for the test cluster was brand new,
> so “broken hardware” wasn’t our first guess. We are still in the process of
> definitely proving that this specific piece of hardware is broken, but we
> are now sufficiently confident in the stability of C* 4, that we are soon
> going to move forward with upgrading the production cluster.
>
> -Sebastian
>
>

Reply via email to