One clarification question Jeff. AFAIK, the *nodetool cleanup* also internally goes through the same compaction path as the regular compaction. Then why do we have to wait for CEP-21 to clean up unowned data in the regular compaction path? Wouldn't it be as simple as regular compaction just invoke the code of *nodetool cleanup*? In other words, without CEP-21, why is *nodetool cleanup* a safer operation but doing the same in the regular compaction isn't?
Jaydeep On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote: > Thanks, Jeff, for the detailed steps and summary. > We will keep the community (this thread) up to date on how it plays out in > our fleet. > > Jaydeep > > On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa <jji...@gmail.com> wrote: > >> Lots of caveats on these suggestions, let me try to hit most of them. >> >> Cleanup in parallel is good and fine and common. Limit number of threads >> in cleanup if you're using lots of vnodes, so each node runs one at a time >> and not all nodes use all your cores at the same time. >> If a host is fully offline, you can ALSO use replace address first boot. >> It'll stream data right to that host with the same token assignments you >> had before, and no cleanup is needed then. Strictly speaking, to avoid >> resurrection here, you'd want to run repair on the replicas of the down >> host (for vnodes, probably the whole cluster), but your current process >> doesnt guarantee that either (decom + bootstrap may resurrect, strictly >> speaking). >> Dropping vnodes will reduce the replicas that have to be cleaned up, but >> also potentially increase your imbalance on each replacement. >> >> Cassandra should still do this on its own, and I think once CEP-21 is >> committed, this should be one of the first enhancement tickets. >> >> Until then, LeveledCompactionStrategy really does make cleanup fast and >> cheap, at the cost of higher IO the rest of the time. If you can tolerate >> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster >> data deletion than STCS). It's a lot of IO compared to STCS though. >> >> >> >> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >>> Thanks all for your valuable inputs. We will try some of the suggested >>> methods in this thread, and see how it goes. We will keep you updated on >>> our progress. >>> Thanks a lot once again! >>> >>> Jaydeep >>> >>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user < >>> user@cassandra.apache.org> wrote: >>> >>>> Depending on the number of vnodes per server, the probability and >>>> severity (i.e. the size of the affected token ranges) of an availability >>>> degradation due to a server failure during node replacement may be small. >>>> You also have the choice of increasing the RF if that's still not >>>> acceptable. >>>> >>>> Also, reducing number of vnodes per server can limit the number of >>>> servers affected by replacing a single server, therefore reducing the >>>> amount of time required to run "nodetool cleanup" if it is run >>>> sequentially. >>>> >>>> Finally, you may choose to run "nodetool cleanup" concurrently on >>>> multiple nodes to reduce the amount of time required to complete it. >>>> >>>> >>>> On 05/05/2023 16:26, Runtian Liu wrote: >>>> >>>> We are doing the "adding a node then decommissioning a node" to >>>> achieve better availability. Replacing a node need to shut down one node >>>> first, if another node is down during the node replacement period, we will >>>> get availability drop because most of our use case is local_quorum with >>>> replication factor 3. >>>> >>>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user < >>>> user@cassandra.apache.org> wrote: >>>> >>>>> Have you thought of using "-Dcassandra.replace_address_first_boot=..." >>>>> (or "-Dcassandra.replace_address=..." if you are using an older version)? >>>>> This will not result in a topology change, which means "nodetool cleanup" >>>>> is not needed after the operation is completed. >>>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote: >>>>> >>>>> Thanks, Jeff! >>>>> But in our environment we replace nodes quite often for various >>>>> optimization purposes, etc. say, almost 1 node per day (node >>>>> *addition* followed by node *decommission*, which of course changes >>>>> the topology), and we have a cluster of size 100 nodes with 300GB per >>>>> node. >>>>> If we have to run cleanup on 100 nodes after every replacement, then it >>>>> could take forever. >>>>> What is the recommendation until we get this fixed in Cassandra itself >>>>> as part of compaction (w/o externally triggering *cleanup*)? >>>>> >>>>> Jaydeep >>>>> >>>>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>>> >>>>>> Cleanup is fast and cheap and basically a no-op if you haven’t >>>>>> changed the ring >>>>>> >>>>>> After cassandra has transactional cluster metadata to make ring >>>>>> changes strongly consistent, cassandra should do this in every >>>>>> compaction. >>>>>> But until then it’s left for operators to run when they’re sure the state >>>>>> of the ring is correct . >>>>>> >>>>>> >>>>>> >>>>>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia < >>>>>> chovatia.jayd...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> Isn't this considered a kind of *bug* in Cassandra because as we >>>>>> know *cleanup* is a lengthy and unreliable operation, so relying >>>>>> on the *cleanup* means higher chances of data resurrection? >>>>>> Do you think we should discard the unowned token-ranges as part of >>>>>> the regular compaction itself? What are the pitfalls of doing this as >>>>>> part >>>>>> of compaction itself? >>>>>> >>>>>> Jaydeep >>>>>> >>>>>> On Thu, May 4, 2023 at 7:25 PM guo Maxwell <cclive1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> compact ion will just merge duplicate data and remove delete data in >>>>>>> this node .if you add or remove one node for the cluster, I think clean >>>>>>> up >>>>>>> is needed. if clean up failed, I think we should come to see the reason. >>>>>>> >>>>>>> Runtian Liu <curly...@gmail.com> 于2023年5月5日周五 06:37写道: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> Is cleanup the sole method to remove data that does not belong to a >>>>>>>> specific node? In a cluster, where nodes are added or decommissioned >>>>>>>> from >>>>>>>> time to time, failure to run cleanup may lead to data resurrection >>>>>>>> issues, >>>>>>>> as deleted data may remain on the node that lost ownership of certain >>>>>>>> partitions. Or is it true that normal compactions can also handle data >>>>>>>> removal for nodes that no longer have ownership of certain data? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Runtian >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> you are the apple of my eye ! >>>>>>> >>>>>>