Cleanup, by itself, uses all the compactors available. So, it is important to 
see if you have the disk space for multiple large cleanup compactions running 
at the same time. We have a utility to do cleanup more intelligently – it 
temporarily doubles compaction throughput, operates on a single keyspace, sorts 
by table size ascending, and runs only 1 thread (-j 1) at a time to protect 
against the multiple large compactions at the same time issue. It also verifies 
that there is enough disk space to handle the largest sstable for the table 
about to be cleaned up.

It works very well in the use cases where we have a stair step arrangement of 
table sizes. We recover space from smaller tables and work up to the largest 
ones with whatever extra space we have acquired.


Sean R. Durity

From: Dipan Shah <dipan.s...@anant.us>
Sent: Friday, February 17, 2023 2:50 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Cleanup

Hi Marc, Changes done using "nodetool setcompactionthroughput" will only be 
applicable till Cassandra service restart. The throughput value will revert 
back to the settings inside cassandra. yaml post service restart. On Fri, Feb 
17,

Hi Marc,

Changes done using "nodetool setcompactionthroughput" will only be applicable 
till Cassandra service restart.

The throughput value will revert back to the settings inside cassandra.yaml 
post service restart.

On Fri, Feb 17, 2023 at 1:04 PM Marc Hoppins 
<marc.hopp...@eset.com<mailto:marc.hopp...@eset.com>> wrote:
…and if it is altered via nodetool, is it altered until manually changed or 
service restart, so must be manually put pack?



INTERNAL USE
From: Aaron Ploetz <aaronplo...@gmail.com<mailto:aaronplo...@gmail.com>>
Sent: Thursday, February 16, 2023 4:50 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Cleanup

EXTERNAL
So if I remember right, setting compaction_throughput_per_mb to zero 
effectively disables throttling, which means cleanup and compaction will run as 
fast as the instance will allow.  For normal use, I'd recommend capping that at 
8 or 16.

Aaron


On Thu, Feb 16, 2023 at 9:43 AM Marc Hoppins 
<marc.hopp...@eset.com<mailto:marc.hopp...@eset.com>> wrote:
Compaction_throughtput_per_mb is 0 in cassandra.yaml. Is setting it in nodetool 
going to provide any increase?

From: Durity, Sean R via user 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Sent: Thursday, February 16, 2023 4:20 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Cleanup

EXTERNAL
Clean-up is constrained/throttled by compactionthroughput. If your system can 
handle it, you can increase that throughput (nodetool setcompactionthroughput) 
for the clean-up in order to reduce the total time.

It is a node-isolated operation, not cluster-involved. I often run clean up on 
all nodes in a DC at the same time. Think of it as compaction and consider your 
cluster performance/workload/timelines accordingly.

Sean R. Durity

From: manish khandelwal 
<manishkhandelwa...@gmail.com<mailto:manishkhandelwa...@gmail.com>>
Sent: Thursday, February 16, 2023 5:05 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: [EXTERNAL] Re: Cleanup

There is no advantage of running cleanup if no new nodes are introduced. So 
cleanup time should remain same when adding new nodes. Cleanup is a local to 
node so network bandwidth should have no effect on reducing cleanup time. Dont 
ignore cleanup

There is no advantage of running cleanup if no new nodes are introduced. So 
cleanup time should remain same when adding new nodes.

 Cleanup is a local to node so network bandwidth should have no effect on 
reducing cleanup time.

 Dont ignore cleanup as it can cause you disks occupied without any use.

 You should plan to run cleanup in a lean period (low traffic). Also you can 
use suboptions of keyspace and table names to plan it such a way that I/O 
pressure is not much.


Regards
Manish

On Thu, Feb 16, 2023 at 3:12 PM Marc Hoppins 
<marc.hopp...@eset.com<mailto:marc.hopp...@eset.com>> wrote:
Hulloa all,

I read a thing re. adding new nodes where the recommendation was to run cleanup 
on the nodes after adding a new node to remove redundant token ranges.

I timed this way back when we only had ~20G of data per node and it took 
approx. 5 mins per node.  After adding a node on Tuesday, I figured I’d run 
cleanup.

Per node, it is taking 6+ hours now as we have 2-2.5T per node.

Should we be running cleanup regularly regardless of whether or not new nodes 
have been added?  Would it reduce cleanup times for when we do add new nodes?
If we double the network bandwidth can we effectively reduce this lengthy 
cleanup?
Maybe just ignore cleanup entirely?
I appreciate that cleanup will increase the load but running cleanup on one 
node at a time seems impractical.  How many simultaneous nodes (per rack) 
should we limit cleanup to?

More experienced suggestions would be most appreciated.

Marc


INTERNAL USE


--

Thanks,

Dipan Shah

Data Engineer

[cid:~WRD0000.jpg]



3 Washington Circle NW, Suite 301

Washington, D.C. 20037



Check out our blog 
[blog.anant.us]<https://urldefense.com/v3/__https:/blog.anant.us/__;!!M-nmYVHPHQ!K738tf4vO__zOC0ZvdNwF6zM8vywmTrqh-4FO8soeyyZRpb09H9g5tAiRRde9HyQHLwRLfDxad9IWAtYaGyvYSY9UwM$>!



This email and any attachments to it may be confidential and are intended 
solely for the use of the individual to whom it is addressed. Any views or 
opinions expressed are solely those of the author and do not necessarily 
represent those of Anant Corporation. If you are not the intended recipient of 
this email, you must neither take any action based upon its contents, nor copy 
or show it to anyone. Please contact the sender if you believe you have 
received this email in error.

Reply via email to