By retry logic, I’m going to guess you are doing some kind of version consistency trick where you have a non-key column managing a visibility horizon to simulate a transaction, and you poll for a horizon value >= some threshold that the app is keeping aware of.
Note that these assorted variations on trying to do battle with eventual consistency can generate a lot of load on the cluster, unless there is enough latency in the progress of the logical flow at the app level that the optimistic concurrency hack almost always succeeds the first time anyways. If this generates the degree of java garbage collection that I suspect, then the advice to upgrade C* becomes even more significant. Repairs themselves can generate substantial memory load, and you could have a node or two drop out on you if they OOM. I’d definitely take Jeff’s advice about switching your reads to LOCAL_QUORUM until you’re done to buffer yourself from that risk. From: Leena Ghatpande <lghatpa...@hotmail.com> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Tuesday, May 26, 2020 at 1:20 PM To: "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Re: any risks with changing replication factor on live production cluster without downtime and service interruption? Message from External Sender Thank you for the response. Will follow the recommendation for the update. So with Read=LOCAL_QUORUM we should see some latency, but not failures during RF change right? We do mitigate the issue of not seeing writes when set to Local_one, by having a Retry logic in the app ________________________________ From: Leena Ghatpande <lghatpa...@hotmail.com> Sent: Friday, May 22, 2020 11:51 AM To: cassandra cassandra <user@cassandra.apache.org> Subject: any risks with changing replication factor on live production cluster without downtime and service interruption? We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each DC. RF=3 We have around 150M rows across tables. We are planning to add more nodes to the cluster, and thinking of changing the replication factor to 5 for each DC. Our application uses the below consistency level read-level: LOCAL_ONE write-level: LOCAL_QUORUM if we change the RF=5 on live cluster, and run full repairs, would we see read/write errors while data is being replicated? if so, This is not something that we can afford in production, so how would we avoid this?