By retry logic, I’m going to guess you are doing some kind of version 
consistency trick where you have a non-key column managing a visibility horizon 
to simulate a transaction, and you poll for a horizon value >= some threshold 
that the app is keeping aware of.

Note that these assorted variations on trying to do battle with eventual 
consistency can generate a lot of load on the cluster, unless there is enough 
latency in the progress of the logical flow at the app level that the 
optimistic concurrency hack almost always succeeds the first time anyways.

If this generates the degree of java garbage collection that I suspect, then 
the advice to upgrade C* becomes even more significant.  Repairs themselves can 
generate substantial memory load, and you could have a node or two drop out on 
you if they OOM. I’d definitely take Jeff’s advice about switching your reads 
to LOCAL_QUORUM until you’re done to buffer yourself from that risk.


From: Leena Ghatpande <lghatpa...@hotmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tuesday, May 26, 2020 at 1:20 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: any risks with changing replication factor on live production 
cluster without downtime and service interruption?

Message from External Sender
Thank you for the response. Will follow the recommendation for the update. So 
with Read=LOCAL_QUORUM we should see some latency, but not failures during RF 
change right?

We do mitigate the issue of not seeing writes when set to Local_one, by having 
a Retry logic in the app


________________________________
From: Leena Ghatpande <lghatpa...@hotmail.com>
Sent: Friday, May 22, 2020 11:51 AM
To: cassandra cassandra <user@cassandra.apache.org>
Subject: any risks with changing replication factor on live production cluster 
without downtime and service interruption?

We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each 
DC. RF=3
We have around 150M rows across tables.

We are planning to add more nodes to the cluster, and thinking of changing the 
replication factor to 5 for each DC.

Our application uses the below consistency level
 read-level: LOCAL_ONE
 write-level: LOCAL_QUORUM

if we change the RF=5 on live cluster, and run full repairs, would we see 
read/write errors while data is being replicated?
if so, This is not something that we can afford in production, so how would we 
avoid this?

Reply via email to