In AWS we are doing os upgrades by standing up an equivalent set of nodes, and then in a rolling fashion we move the EBS mounts, selectively sync necessary cassandra settings, and then start up the new node.
The downtime is fairly minimal since the ebs detach/attach is pretty quick, you don't need to wait for a long exhaustive OS upgrade process (we are hopping to ubu 18 LTS from 16 LTS, so that would be a loooong upgrade. We have lots of code already to do checks, cassandra shutdown/startup, waits, gossip stability checks and the like from our cassandra version upgrade scripts. We are getting bit by the us-east-1e region being "deprecated" and not having m5 instances, so we have to rsync those, but we do a pre-rsync of the node, then right before the shutdown we do another rsync, then shutdown, do a final incremental rsync, which also reduces the downtime, but the initial rsync can take a while depending on the node size. Remember you don't have to rsync the snapshots. On Tue, Feb 4, 2020 at 12:03 PM Jeff Jirsa <jji...@gmail.com> wrote: > The original question was OS (and / or JDK) upgrades, and for those "do it > in QA first", bounce non-replicas and let them come up before proceeding. > > If you're doing Cassandra itself, it's a REALLY REALLY REALLY good idea to > try the upgrade on a backup of your production cluster first. While it's > uncommon, there were certain things allowed in very old versions of > Cassandra that are occasionally missed in upgrade tests, and can be found > the hard way. Sometimes those incompatible schema changes should have > disallowed but weren't - sometimes they leave data on disk that cassandra > 3.0/3.11/etc know is illegal or invalid, and throws exceptions when trying > to read or upgradesstables. If you don't have a backup of prod for whatever > reason (high churn, ephemeral data, whatever), try at least a backup of one > host. > > This is especially if you may have done a lot of schema changes in > 1.0/1.2/2.0 era clusters - take a snapshot, fire it up in a new cluster, > try upgrading offline first. > > On Tue, Feb 4, 2020 at 9:29 AM Reid Pinchback <rpinchb...@tripadvisor.com> > wrote: > >> Another thing I'll add, since I don't think any of the other responders >> brought it up. >> >> This all assumes that you already believe that the update is safe. If >> you have any kind of test cluster, I'd evaluate the change there first. >> >> While I haven't hit it with C* specifically, I have seen database >> problems from the O/S related to shared library updates. With database >> technologies mostly formed from HA pairs your blast radius on data >> corruption can be limited, so by the time you realize the problem you >> haven't risked more than you intended. C* is a bit more biased to propagate >> a lot of data across a common infrastructure. Just keep that in mind when >> reviewing what the changes are in the O/S updates. When the updates are >> related to external utilities and services, but not to shared libraries, >> usually your only degradation risks relate to performance and availability >> so you are in a safer position to forge ahead and then it only comes down >> to proper C* hygiene. Performance and connectivity risks can be mitigated >> by having one DC you update first, and then letting stew awhile as you >> evaluate the results. Plan in advance what you want to evaluate before >> continuing on. >> >> R >> >> On 1/30/20, 11:09 AM, "Michael Shuler" <mich...@pbandjelly.org> wrote: >> >> Message from External Sender >> >> That is some good info. To add just a little more, knowing what the >> pending security updates are for your nodes helps in knowing what to >> do >> after. Read the security update notes from your vendor. >> >> Java or Cassandra update? Of course the service needs restarted - >> rolling upgrade and restart the `cassandra` service as usual. >> >> Linux kernel update? Node needs a full reboot, so follow a rolling >> reboot plan. >> >> Other OS updates? Most can be done while not affecting Cassandra. For >> instance, an OpenSSH security update to patch some vulnerability >> should >> most certainly be done as soon as possible, and the node updates can >> be >> even be in parallel without causing any problems with the JVM or >> Cassandra service. Most intelligent package update systems will >> install >> the update and restart the affected service, in this hypothetical >> case >> `sshd`. >> >> Michael >> >> On 1/30/20 3:56 AM, Erick Ramirez wrote: >> > There is no need to shutdown the application because you should be >> able >> > to carry out the operating system upgraded without an outage to the >> > database particularly since you have a lot of nodes in your cluster. >> > >> > Provided your cluster has sufficient capacity, you might even have >> the >> > ability to upgrade multiple nodes in parallel to reduce the upgrade >> > window. If you decide to do nodes in parallel and you fully >> understand >> > the token allocations and where the nodes are positioned in the >> ring in >> > each DC, make sure you only upgrade nodes which are at least 5 >> nodes >> > "away" to the right so you know none of the nodes would have >> overlapping >> > token ranges and they're not replicas of each other. >> > >> > Other points to consider are: >> > >> > * If a node goes down (for whatever reason), I suggest you >> upgrade the >> > OS on the node before bringing back up. It's already down so you >> > might as well take advantage of it since you have so many nodes >> to >> > upgrade. >> > * Resist the urge to run nodetool decommission or nodetool >> removenode >> > if you encounter an issue while upgrading a node. This is a >> common >> > knee-jerk reaction which can prove costly because the cluster >> will >> > rebalance automatically, adding more time to your upgrade >> window. >> > Either fix the problem on the server or replace node using the >> > "replace_address" flag. >> > * Test, test, and test again. Familiarity with the process is your >> > friend when the unexpected happens. >> > * Plan ahead and rehearse your recovery method (i.e. replace the >> node) >> > should you run into unexpected issues. >> > * Stick to the plan and be prepared to implement it -- don't >> deviate. >> > Don't spend 4 hours or more investigating why a server won't >> start. >> > * Be decisive. Activate your recovery/remediation plan >> immediately. >> > >> > I'm sure others will chime in with their recommendations. Let us >> know >> > how you go as I'm sure others would be interested in hearing from >> your >> > experience. Not a lot of shops have a deployment as large as yours >> so >> > you are in an enviable position. Good luck! >> > >> > On Thu, Jan 30, 2020 at 3:45 PM Anshu Vajpayee < >> anshu.vajpa...@gmail.com >> > <mailto:anshu.vajpa...@gmail.com>> wrote: >> > >> > Hi Team, >> > What is the best way to patch OS of 1000 nodes Multi DC >> Cassandra >> > cluster where we cannot suspend application traffic( we can >> redirect >> > traffic to one DC). >> > >> > Please suggest if anyone has any best practice around it. >> > >> > -- >> > *C*heers,* >> > *Anshu V* >> > * >> > * >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >> >> >> >>