In AWS we are doing os upgrades by standing up an equivalent set of nodes,
and then in a rolling fashion we move the EBS mounts, selectively sync
necessary cassandra settings, and then start up the new node.

The downtime is fairly minimal since the ebs detach/attach is pretty quick,
you don't need to wait for a long exhaustive OS upgrade process (we are
hopping to ubu 18 LTS from 16 LTS, so that would be a loooong upgrade.

We have lots of code already to do checks, cassandra shutdown/startup,
waits, gossip stability checks and the like from our cassandra version
upgrade scripts.

We are getting bit by the us-east-1e region being "deprecated" and not
having m5 instances, so we have to rsync those, but we do a pre-rsync of
the node, then right before the shutdown we do another rsync, then
shutdown, do a final incremental rsync, which also reduces the downtime,
but the initial rsync can take a while depending on the node size.

Remember you don't have to rsync the snapshots.

On Tue, Feb 4, 2020 at 12:03 PM Jeff Jirsa <jji...@gmail.com> wrote:

> The original question was OS (and / or JDK) upgrades, and for those "do it
> in QA first", bounce non-replicas and let them come up before proceeding.
>
> If you're doing Cassandra itself, it's a REALLY REALLY REALLY good idea to
> try the upgrade on a backup of your production cluster first. While it's
> uncommon, there were certain things allowed in very old versions of
> Cassandra that are occasionally missed in upgrade tests, and can be found
> the hard way. Sometimes those incompatible schema changes should have
> disallowed but weren't - sometimes they leave data on disk that cassandra
> 3.0/3.11/etc know is illegal or invalid, and throws exceptions when trying
> to read or upgradesstables. If you don't have a backup of prod for whatever
> reason (high churn, ephemeral data, whatever), try at least a backup of one
> host.
>
> This is especially if you may have done a lot of schema changes in
> 1.0/1.2/2.0 era clusters - take a snapshot, fire it up in a new cluster,
> try upgrading offline first.
>
> On Tue, Feb 4, 2020 at 9:29 AM Reid Pinchback <rpinchb...@tripadvisor.com>
> wrote:
>
>> Another thing I'll add, since I don't think any of the other responders
>> brought it up.
>>
>> This all assumes that you already believe that the update is safe.  If
>> you have any kind of test cluster, I'd evaluate the change there first.
>>
>> While I haven't hit it with C* specifically, I have seen database
>> problems from the O/S related to shared library updates.  With database
>> technologies mostly formed from HA pairs your blast radius on data
>> corruption can be limited, so by the time you realize the problem you
>> haven't risked more than you intended. C* is a bit more biased to propagate
>> a lot of data across a common infrastructure.  Just keep that in mind when
>> reviewing what the changes are in the O/S updates.  When the updates are
>> related to external utilities and services, but not to shared libraries,
>> usually your only degradation risks relate to performance and availability
>> so you are in a safer position to forge ahead and then it only comes down
>> to proper C* hygiene. Performance and connectivity risks can be mitigated
>> by having one DC you update first, and then letting stew awhile as you
>> evaluate the results. Plan in advance what you want to evaluate before
>> continuing on.
>>
>> R
>>
>> On 1/30/20, 11:09 AM, "Michael Shuler" <mich...@pbandjelly.org> wrote:
>>
>>      Message from External Sender
>>
>>     That is some good info. To add just a little more, knowing what the
>>     pending security updates are for your nodes helps in knowing what to
>> do
>>     after. Read the security update notes from your vendor.
>>
>>     Java or Cassandra update? Of course the service needs restarted -
>>     rolling upgrade and restart the `cassandra` service as usual.
>>
>>     Linux kernel update? Node needs a full reboot, so follow a rolling
>>     reboot plan.
>>
>>     Other OS updates? Most can be done while not affecting Cassandra. For
>>     instance, an OpenSSH security update to patch some vulnerability
>> should
>>     most certainly be done as soon as possible, and the node updates can
>> be
>>     even be in parallel without causing any problems with the JVM or
>>     Cassandra service. Most intelligent package update systems will
>> install
>>     the update and restart the affected service, in this hypothetical
>> case
>>     `sshd`.
>>
>>     Michael
>>
>>     On 1/30/20 3:56 AM, Erick Ramirez wrote:
>>     > There is no need to shutdown the application because you should be
>> able
>>     > to carry out the operating system upgraded without an outage to the
>>     > database particularly since you have a lot of nodes in your cluster.
>>     >
>>     > Provided your cluster has sufficient capacity, you might even have
>> the
>>     > ability to upgrade multiple nodes in parallel to reduce the upgrade
>>     > window. If you decide to do nodes in parallel and you fully
>> understand
>>     > the token allocations and where the nodes are positioned in the
>> ring in
>>     > each DC, make sure you only upgrade nodes which are at least 5
>> nodes
>>     > "away" to the right so you know none of the nodes would have
>> overlapping
>>     > token ranges and they're not replicas of each other.
>>     >
>>     > Other points to consider are:
>>     >
>>     >   * If a node goes down (for whatever reason), I suggest you
>> upgrade the
>>     >     OS on the node before bringing back up. It's already down so you
>>     >     might as well take advantage of it since you have so many nodes
>> to
>>     >     upgrade.
>>     >   * Resist the urge to run nodetool decommission or nodetool
>> removenode
>>     >     if you encounter an issue while upgrading a node. This is a
>> common
>>     >     knee-jerk reaction which can prove costly because the cluster
>> will
>>     >     rebalance automatically, adding more time to your upgrade
>> window.
>>     >     Either fix the problem on the server or replace node using the
>>     >     "replace_address" flag.
>>     >   * Test, test, and test again. Familiarity with the process is your
>>     >     friend when the unexpected happens.
>>     >   * Plan ahead and rehearse your recovery method (i.e. replace the
>> node)
>>     >     should you run into unexpected issues.
>>     >   * Stick to the plan and be prepared to implement it -- don't
>> deviate.
>>     >     Don't spend 4 hours or more investigating why a server won't
>> start.
>>     >   * Be decisive. Activate your recovery/remediation plan
>> immediately.
>>     >
>>     > I'm sure others will chime in with their recommendations. Let us
>> know
>>     > how you go as I'm sure others would be interested in hearing from
>> your
>>     > experience. Not a lot of shops have a deployment as large as yours
>> so
>>     > you are in an enviable position. Good luck!
>>     >
>>     > On Thu, Jan 30, 2020 at 3:45 PM Anshu Vajpayee <
>> anshu.vajpa...@gmail.com
>>     > <mailto:anshu.vajpa...@gmail.com>> wrote:
>>     >
>>     >     Hi Team,
>>     >     What is the best way to patch OS of 1000 nodes Multi DC
>> Cassandra
>>     >     cluster where we cannot suspend application traffic( we can
>> redirect
>>     >     traffic to one DC).
>>     >
>>     >     Please suggest if anyone has any best practice around it.
>>     >
>>     >     --
>>     >     *C*heers,*
>>     >     *Anshu V*
>>     >     *
>>     >     *
>>     >
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>     For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>>
>>

Reply via email to