Re: Is cleanup is required if cluster topology changes

Jaydeep Chovatia Tue, 09 May 2023 22:57:30 -0700

Another request to the community to see if this is feasible or not:
Can we not wait for (CEP-21), and do the necessary cleanup as part of
regular compaction itself to avoid running *cleanup* manually? For now, we
can control through a flag, which is *false* by default. Whosoever wants to
do the cleanup as part of compaction will turn this flag on. Once we have
CEP-21 addressed, then we can remove this flag, and enable this always.
Thoughts?


Jaydeep

On Tue, May 9, 2023 at 3:58 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Because an operator will need to check and ensure the schema is consistent
> across the cluster before running "nodetool cleanup". At the moment, it's
> the operator's responsibility to ensure bad things don't happen.
> On 09/05/2023 06:20, Jaydeep Chovatia wrote:
>
> One clarification question Jeff.
> AFAIK, the *nodetool cleanup* also internally goes through the same
> compaction path as the regular compaction. Then why do we have to wait for
> CEP-21 to clean up unowned data in the regular compaction path? Wouldn't it
> be as simple as regular compaction just invoke the code of *nodetool
> cleanup*?
> In other words, without CEP-21, why is *nodetool cleanup* a safer
> operation but doing the same in the regular compaction isn't?
>
> Jaydeep
>
> On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Thanks, Jeff, for the detailed steps and summary.
>> We will keep the community (this thread) up to date on how it plays out
>> in our fleet.
>>
>> Jaydeep
>>
>> On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> Lots of caveats on these suggestions, let me try to hit most of them.
>>>
>>> Cleanup in parallel is good and fine and common. Limit number of threads
>>> in cleanup if you're using lots of vnodes, so each node runs one at a time
>>> and not all nodes use all your cores at the same time.
>>> If a host is fully offline, you can ALSO use replace address first boot.
>>> It'll stream data right to that host with the same token assignments you
>>> had before, and no cleanup is needed then. Strictly speaking, to avoid
>>> resurrection here, you'd want to run repair on the replicas of the down
>>> host (for vnodes, probably the whole cluster), but your current process
>>> doesnt guarantee that either (decom + bootstrap may resurrect, strictly
>>> speaking).
>>> Dropping vnodes will reduce the replicas that have to be cleaned up, but
>>> also potentially increase your imbalance on each replacement.
>>>
>>> Cassandra should still do this on its own, and I think once CEP-21 is
>>> committed, this should be one of the first enhancement tickets.
>>>
>>> Until then, LeveledCompactionStrategy really does make cleanup fast and
>>> cheap, at the cost of higher IO the rest of the time. If you can tolerate
>>> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster
>>> data deletion than STCS). It's a lot of IO compared to STCS though.
>>>
>>>
>>>
>>> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>>> Thanks all for your valuable inputs. We will try some of the suggested
>>>> methods in this thread, and see how it goes. We will keep you updated on
>>>> our progress.
>>>> Thanks a lot once again!
>>>>
>>>> Jaydeep
>>>>
>>>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
>>>> user@cassandra.apache.org> wrote:
>>>>
>>>>> Depending on the number of vnodes per server, the probability and
>>>>> severity (i.e. the size of the affected token ranges) of an availability
>>>>> degradation due to a server failure during node replacement may be small.
>>>>> You also have the choice of increasing the RF if that's still not
>>>>> acceptable.
>>>>>
>>>>> Also, reducing number of vnodes per server can limit the number of
>>>>> servers affected by replacing a single server, therefore reducing the
>>>>> amount of time required to run "nodetool cleanup" if it is run 
>>>>> sequentially.
>>>>>
>>>>> Finally, you may choose to run "nodetool cleanup" concurrently on
>>>>> multiple nodes to reduce the amount of time required to complete it.
>>>>>
>>>>>
>>>>> On 05/05/2023 16:26, Runtian Liu wrote:
>>>>>
>>>>> We are doing the "adding a node then decommissioning a node" to
>>>>> achieve better availability. Replacing a node need to shut down one node
>>>>> first, if another node is down during the node replacement period, we will
>>>>> get availability drop because most of our use case is local_quorum with
>>>>> replication factor 3.
>>>>>
>>>>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user <
>>>>> user@cassandra.apache.org> wrote:
>>>>>
>>>>>> Have you thought of using
>>>>>> "-Dcassandra.replace_address_first_boot=..." (or
>>>>>> "-Dcassandra.replace_address=..." if you are using an older version)? 
>>>>>> This
>>>>>> will not result in a topology change, which means "nodetool cleanup" is 
>>>>>> not
>>>>>> needed after the operation is completed.
>>>>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote:
>>>>>>
>>>>>> Thanks, Jeff!
>>>>>> But in our environment we replace nodes quite often for various
>>>>>> optimization purposes, etc. say, almost 1 node per day (node
>>>>>> *addition* followed by node *decommission*, which of course changes
>>>>>> the topology), and we have a cluster of size 100 nodes with 300GB per 
>>>>>> node.
>>>>>> If we have to run cleanup on 100 nodes after every replacement, then it
>>>>>> could take forever.
>>>>>> What is the recommendation until we get this fixed in Cassandra
>>>>>> itself as part of compaction (w/o externally triggering *cleanup*)?
>>>>>>
>>>>>> Jaydeep
>>>>>>
>>>>>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>>
>>>>>>> Cleanup is fast and cheap and basically a no-op if you haven’t
>>>>>>> changed the ring
>>>>>>>
>>>>>>> After cassandra has transactional cluster metadata to make ring
>>>>>>> changes strongly consistent, cassandra should do this in every 
>>>>>>> compaction.
>>>>>>> But until then it’s left for operators to run when they’re sure the 
>>>>>>> state
>>>>>>> of the ring is correct .
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia <
>>>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>>>
>>>>>>> 
>>>>>>> Isn't this considered a kind of *bug* in Cassandra because as we
>>>>>>> know *cleanup* is a lengthy and unreliable operation, so relying
>>>>>>> on the *cleanup* means higher chances of data resurrection?
>>>>>>> Do you think we should discard the unowned token-ranges as part of
>>>>>>> the regular compaction itself? What are the pitfalls of doing this as 
>>>>>>> part
>>>>>>> of compaction itself?
>>>>>>>
>>>>>>> Jaydeep
>>>>>>>
>>>>>>> On Thu, May 4, 2023 at 7:25 PM guo Maxwell <cclive1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> compact ion will just merge duplicate data and remove delete data
>>>>>>>> in this node .if you add or remove one node for the cluster, I think 
>>>>>>>> clean
>>>>>>>> up is needed. if clean up failed, I think we should come to see the 
>>>>>>>> reason.
>>>>>>>>
>>>>>>>> Runtian Liu <curly...@gmail.com> 于2023年5月5日周五 06:37写道：
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Is cleanup the sole method to remove data that does not belong to
>>>>>>>>> a specific node? In a cluster, where nodes are added or 
>>>>>>>>> decommissioned from
>>>>>>>>> time to time, failure to run cleanup may lead to data resurrection 
>>>>>>>>> issues,
>>>>>>>>> as deleted data may remain on the node that lost ownership of certain
>>>>>>>>> partitions. Or is it true that normal compactions can also handle data
>>>>>>>>> removal for nodes that no longer have ownership of certain data?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Runtian
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> you are the apple of my eye !
>>>>>>>>
>>>>>>>

Re: Is cleanup is required if cluster topology changes

Reply via email to