One clarification question Jeff.
AFAIK, the *nodetool cleanup* also internally goes through the same
compaction path as the regular compaction. Then why do we have to wait for
CEP-21 to clean up unowned data in the regular compaction path? Wouldn't it
be as simple as regular compaction just invoke the code of *nodetool
cleanup*?
In other words, without CEP-21, why is *nodetool cleanup* a safer operation
but doing the same in the regular compaction isn't?

Jaydeep

On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia <chovatia.jayd...@gmail.com>
wrote:

> Thanks, Jeff, for the detailed steps and summary.
> We will keep the community (this thread) up to date on how it plays out in
> our fleet.
>
> Jaydeep
>
> On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Lots of caveats on these suggestions, let me try to hit most of them.
>>
>> Cleanup in parallel is good and fine and common. Limit number of threads
>> in cleanup if you're using lots of vnodes, so each node runs one at a time
>> and not all nodes use all your cores at the same time.
>> If a host is fully offline, you can ALSO use replace address first boot.
>> It'll stream data right to that host with the same token assignments you
>> had before, and no cleanup is needed then. Strictly speaking, to avoid
>> resurrection here, you'd want to run repair on the replicas of the down
>> host (for vnodes, probably the whole cluster), but your current process
>> doesnt guarantee that either (decom + bootstrap may resurrect, strictly
>> speaking).
>> Dropping vnodes will reduce the replicas that have to be cleaned up, but
>> also potentially increase your imbalance on each replacement.
>>
>> Cassandra should still do this on its own, and I think once CEP-21 is
>> committed, this should be one of the first enhancement tickets.
>>
>> Until then, LeveledCompactionStrategy really does make cleanup fast and
>> cheap, at the cost of higher IO the rest of the time. If you can tolerate
>> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster
>> data deletion than STCS). It's a lot of IO compared to STCS though.
>>
>>
>>
>> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Thanks all for your valuable inputs. We will try some of the suggested
>>> methods in this thread, and see how it goes. We will keep you updated on
>>> our progress.
>>> Thanks a lot once again!
>>>
>>> Jaydeep
>>>
>>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
>>>> Depending on the number of vnodes per server, the probability and
>>>> severity (i.e. the size of the affected token ranges) of an availability
>>>> degradation due to a server failure during node replacement may be small.
>>>> You also have the choice of increasing the RF if that's still not
>>>> acceptable.
>>>>
>>>> Also, reducing number of vnodes per server can limit the number of
>>>> servers affected by replacing a single server, therefore reducing the
>>>> amount of time required to run "nodetool cleanup" if it is run 
>>>> sequentially.
>>>>
>>>> Finally, you may choose to run "nodetool cleanup" concurrently on
>>>> multiple nodes to reduce the amount of time required to complete it.
>>>>
>>>>
>>>> On 05/05/2023 16:26, Runtian Liu wrote:
>>>>
>>>> We are doing the "adding a node then decommissioning a node" to
>>>> achieve better availability. Replacing a node need to shut down one node
>>>> first, if another node is down during the node replacement period, we will
>>>> get availability drop because most of our use case is local_quorum with
>>>> replication factor 3.
>>>>
>>>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user <
>>>> user@cassandra.apache.org> wrote:
>>>>
>>>>> Have you thought of using "-Dcassandra.replace_address_first_boot=..."
>>>>> (or "-Dcassandra.replace_address=..." if you are using an older version)?
>>>>> This will not result in a topology change, which means "nodetool cleanup"
>>>>> is not needed after the operation is completed.
>>>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote:
>>>>>
>>>>> Thanks, Jeff!
>>>>> But in our environment we replace nodes quite often for various
>>>>> optimization purposes, etc. say, almost 1 node per day (node
>>>>> *addition* followed by node *decommission*, which of course changes
>>>>> the topology), and we have a cluster of size 100 nodes with 300GB per 
>>>>> node.
>>>>> If we have to run cleanup on 100 nodes after every replacement, then it
>>>>> could take forever.
>>>>> What is the recommendation until we get this fixed in Cassandra itself
>>>>> as part of compaction (w/o externally triggering *cleanup*)?
>>>>>
>>>>> Jaydeep
>>>>>
>>>>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>
>>>>>> Cleanup is fast and cheap and basically a no-op if you haven’t
>>>>>> changed the ring
>>>>>>
>>>>>> After cassandra has transactional cluster metadata to make ring
>>>>>> changes strongly consistent, cassandra should do this in every 
>>>>>> compaction.
>>>>>> But until then it’s left for operators to run when they’re sure the state
>>>>>> of the ring is correct .
>>>>>>
>>>>>>
>>>>>>
>>>>>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia <
>>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>>
>>>>>> 
>>>>>> Isn't this considered a kind of *bug* in Cassandra because as we
>>>>>> know *cleanup* is a lengthy and unreliable operation, so relying
>>>>>> on the *cleanup* means higher chances of data resurrection?
>>>>>> Do you think we should discard the unowned token-ranges as part of
>>>>>> the regular compaction itself? What are the pitfalls of doing this as 
>>>>>> part
>>>>>> of compaction itself?
>>>>>>
>>>>>> Jaydeep
>>>>>>
>>>>>> On Thu, May 4, 2023 at 7:25 PM guo Maxwell <cclive1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> compact ion will just merge duplicate data and remove delete data in
>>>>>>> this node .if you add or remove one node for the cluster, I think clean 
>>>>>>> up
>>>>>>> is needed. if clean up failed, I think we should come to see the reason.
>>>>>>>
>>>>>>> Runtian Liu <curly...@gmail.com> 于2023年5月5日周五 06:37写道:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Is cleanup the sole method to remove data that does not belong to a
>>>>>>>> specific node? In a cluster, where nodes are added or decommissioned 
>>>>>>>> from
>>>>>>>> time to time, failure to run cleanup may lead to data resurrection 
>>>>>>>> issues,
>>>>>>>> as deleted data may remain on the node that lost ownership of certain
>>>>>>>> partitions. Or is it true that normal compactions can also handle data
>>>>>>>> removal for nodes that no longer have ownership of certain data?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Runtian
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> you are the apple of my eye !
>>>>>>>
>>>>>>

Reply via email to