Another request to the community to see if this is feasible or not: Can we not wait for (CEP-21), and do the necessary cleanup as part of regular compaction itself to avoid running *cleanup* manually? For now, we can control through a flag, which is *false* by default. Whosoever wants to do the cleanup as part of compaction will turn this flag on. Once we have CEP-21 addressed, then we can remove this flag, and enable this always. Thoughts?
Jaydeep On Tue, May 9, 2023 at 3:58 AM Bowen Song via user < user@cassandra.apache.org> wrote: > Because an operator will need to check and ensure the schema is consistent > across the cluster before running "nodetool cleanup". At the moment, it's > the operator's responsibility to ensure bad things don't happen. > On 09/05/2023 06:20, Jaydeep Chovatia wrote: > > One clarification question Jeff. > AFAIK, the *nodetool cleanup* also internally goes through the same > compaction path as the regular compaction. Then why do we have to wait for > CEP-21 to clean up unowned data in the regular compaction path? Wouldn't it > be as simple as regular compaction just invoke the code of *nodetool > cleanup*? > In other words, without CEP-21, why is *nodetool cleanup* a safer > operation but doing the same in the regular compaction isn't? > > Jaydeep > > On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia < > chovatia.jayd...@gmail.com> wrote: > >> Thanks, Jeff, for the detailed steps and summary. >> We will keep the community (this thread) up to date on how it plays out >> in our fleet. >> >> Jaydeep >> >> On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa <jji...@gmail.com> wrote: >> >>> Lots of caveats on these suggestions, let me try to hit most of them. >>> >>> Cleanup in parallel is good and fine and common. Limit number of threads >>> in cleanup if you're using lots of vnodes, so each node runs one at a time >>> and not all nodes use all your cores at the same time. >>> If a host is fully offline, you can ALSO use replace address first boot. >>> It'll stream data right to that host with the same token assignments you >>> had before, and no cleanup is needed then. Strictly speaking, to avoid >>> resurrection here, you'd want to run repair on the replicas of the down >>> host (for vnodes, probably the whole cluster), but your current process >>> doesnt guarantee that either (decom + bootstrap may resurrect, strictly >>> speaking). >>> Dropping vnodes will reduce the replicas that have to be cleaned up, but >>> also potentially increase your imbalance on each replacement. >>> >>> Cassandra should still do this on its own, and I think once CEP-21 is >>> committed, this should be one of the first enhancement tickets. >>> >>> Until then, LeveledCompactionStrategy really does make cleanup fast and >>> cheap, at the cost of higher IO the rest of the time. If you can tolerate >>> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster >>> data deletion than STCS). It's a lot of IO compared to STCS though. >>> >>> >>> >>> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia < >>> chovatia.jayd...@gmail.com> wrote: >>> >>>> Thanks all for your valuable inputs. We will try some of the suggested >>>> methods in this thread, and see how it goes. We will keep you updated on >>>> our progress. >>>> Thanks a lot once again! >>>> >>>> Jaydeep >>>> >>>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user < >>>> user@cassandra.apache.org> wrote: >>>> >>>>> Depending on the number of vnodes per server, the probability and >>>>> severity (i.e. the size of the affected token ranges) of an availability >>>>> degradation due to a server failure during node replacement may be small. >>>>> You also have the choice of increasing the RF if that's still not >>>>> acceptable. >>>>> >>>>> Also, reducing number of vnodes per server can limit the number of >>>>> servers affected by replacing a single server, therefore reducing the >>>>> amount of time required to run "nodetool cleanup" if it is run >>>>> sequentially. >>>>> >>>>> Finally, you may choose to run "nodetool cleanup" concurrently on >>>>> multiple nodes to reduce the amount of time required to complete it. >>>>> >>>>> >>>>> On 05/05/2023 16:26, Runtian Liu wrote: >>>>> >>>>> We are doing the "adding a node then decommissioning a node" to >>>>> achieve better availability. Replacing a node need to shut down one node >>>>> first, if another node is down during the node replacement period, we will >>>>> get availability drop because most of our use case is local_quorum with >>>>> replication factor 3. >>>>> >>>>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user < >>>>> user@cassandra.apache.org> wrote: >>>>> >>>>>> Have you thought of using >>>>>> "-Dcassandra.replace_address_first_boot=..." (or >>>>>> "-Dcassandra.replace_address=..." if you are using an older version)? >>>>>> This >>>>>> will not result in a topology change, which means "nodetool cleanup" is >>>>>> not >>>>>> needed after the operation is completed. >>>>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote: >>>>>> >>>>>> Thanks, Jeff! >>>>>> But in our environment we replace nodes quite often for various >>>>>> optimization purposes, etc. say, almost 1 node per day (node >>>>>> *addition* followed by node *decommission*, which of course changes >>>>>> the topology), and we have a cluster of size 100 nodes with 300GB per >>>>>> node. >>>>>> If we have to run cleanup on 100 nodes after every replacement, then it >>>>>> could take forever. >>>>>> What is the recommendation until we get this fixed in Cassandra >>>>>> itself as part of compaction (w/o externally triggering *cleanup*)? >>>>>> >>>>>> Jaydeep >>>>>> >>>>>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>>>> >>>>>>> Cleanup is fast and cheap and basically a no-op if you haven’t >>>>>>> changed the ring >>>>>>> >>>>>>> After cassandra has transactional cluster metadata to make ring >>>>>>> changes strongly consistent, cassandra should do this in every >>>>>>> compaction. >>>>>>> But until then it’s left for operators to run when they’re sure the >>>>>>> state >>>>>>> of the ring is correct . >>>>>>> >>>>>>> >>>>>>> >>>>>>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia < >>>>>>> chovatia.jayd...@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> Isn't this considered a kind of *bug* in Cassandra because as we >>>>>>> know *cleanup* is a lengthy and unreliable operation, so relying >>>>>>> on the *cleanup* means higher chances of data resurrection? >>>>>>> Do you think we should discard the unowned token-ranges as part of >>>>>>> the regular compaction itself? What are the pitfalls of doing this as >>>>>>> part >>>>>>> of compaction itself? >>>>>>> >>>>>>> Jaydeep >>>>>>> >>>>>>> On Thu, May 4, 2023 at 7:25 PM guo Maxwell <cclive1...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> compact ion will just merge duplicate data and remove delete data >>>>>>>> in this node .if you add or remove one node for the cluster, I think >>>>>>>> clean >>>>>>>> up is needed. if clean up failed, I think we should come to see the >>>>>>>> reason. >>>>>>>> >>>>>>>> Runtian Liu <curly...@gmail.com> 于2023年5月5日周五 06:37写道: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Is cleanup the sole method to remove data that does not belong to >>>>>>>>> a specific node? In a cluster, where nodes are added or >>>>>>>>> decommissioned from >>>>>>>>> time to time, failure to run cleanup may lead to data resurrection >>>>>>>>> issues, >>>>>>>>> as deleted data may remain on the node that lost ownership of certain >>>>>>>>> partitions. Or is it true that normal compactions can also handle data >>>>>>>>> removal for nodes that no longer have ownership of certain data? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Runtian >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> you are the apple of my eye ! >>>>>>>> >>>>>>>