There is no need to shutdown the application because you should be able to
carry out the operating system upgraded without an outage to the database
particularly since you have a lot of nodes in your cluster.

Provided your cluster has sufficient capacity, you might even have the
ability to upgrade multiple nodes in parallel to reduce the upgrade window.
If you decide to do nodes in parallel and you fully understand the token
allocations and where the nodes are positioned in the ring in each DC, make
sure you only upgrade nodes which are at least 5 nodes "away" to the right
so you know none of the nodes would have overlapping token ranges and
they're not replicas of each other.

Other points to consider are:

   - If a node goes down (for whatever reason), I suggest you upgrade the
   OS on the node before bringing back up. It's already down so you might as
   well take advantage of it since you have so many nodes to upgrade.
   - Resist the urge to run nodetool decommission or nodetool removenode if
   you encounter an issue while upgrading a node. This is a common knee-jerk
   reaction which can prove costly because the cluster will rebalance
   automatically, adding more time to your upgrade window. Either fix the
   problem on the server or replace node using the "replace_address" flag.
   - Test, test, and test again. Familiarity with the process is your
   friend when the unexpected happens.
   - Plan ahead and rehearse your recovery method (i.e. replace the node)
   should you run into unexpected issues.
   - Stick to the plan and be prepared to implement it -- don't deviate.
   Don't spend 4 hours or more investigating why a server won't start.
   - Be decisive. Activate your recovery/remediation plan immediately.

I'm sure others will chime in with their recommendations. Let us know how
you go as I'm sure others would be interested in hearing from your
experience. Not a lot of shops have a deployment as large as yours so you
are in an enviable position. Good luck!

On Thu, Jan 30, 2020 at 3:45 PM Anshu Vajpayee <anshu.vajpa...@gmail.com>
wrote:

> Hi Team,
> What is the best way to patch OS of 1000 nodes Multi DC Cassandra cluster
> where we cannot suspend application traffic( we can redirect traffic to one
> DC).
>
> Please suggest if anyone has any best practice around it.
>
> --
> *C*heers,*
> *Anshu V*
>
>
>

Reply via email to