Re: Re: nodetool cleanup - results in more disk use?

jonathan . colby Tue, 05 Apr 2011 00:34:59 -0700

I think the key thing to remember is that compaction is performed on*similar* sized sstables. so it makes sense that over time this will have acascading effect. I think by default it starts out with compacting 4flushed sstables, then the cycle begins.


On Apr 4, 2011 3:42pm, shimi <shim...@gmail.com> wrote:

The bigger the file the longer it will take for it to be part of acompaction again.Compacting bucket of large files takes longer thencompacting bucket of small files

Shimi

On Mon, Apr 4, 2011 at 3:58 PM, aaron morton aa...@thelastpickle.com>wrote:

mmm, interesting. My theory was....

t0 - major compaction runs, there is now one sstable
t1 - x new sstables have been created
t2 - minor compaction runs and determines there are two buckets, one withthe x new sstables and one with the single big file. The bucket of manyfiles is compacted into one, the bucket of one file is ignored.

I can see that it takes longer for the big file to be involved incompaction again, and when it finally was it would take more time. Butthat minor compactions of new SSTables would still happen at the samerate, especially if they are created at the same rate as previously.

Am I missing something or am I just reading the docs wrong ?

Cheers
Aaron

On 4 Apr 2011, at 22:20, Jonathan Colby wrote:

hi Aaron -

The Datastax documentation brought to light the fact that over time,major compactions will be performed on bigger and bigger SSTables. Theyactually recommend against performing too many major compactions. Whichis why I am wary to trigger too many major compactions ...

http://www.datastax.com/docs/0.7/operations/scheduled_tasks

Performing Major Compaction¶
A major compaction process merges all SSTables for all column
families in a keyspace – not just similar sized ones, as in minor
compaction. Note that this may create extremely large SStables that
result in long intervals before the next minor compaction (and a
resulting increase in CPU usage for each minor compaction).
Though a major compaction ultimately frees disk space used by
accumulated SSTables, during runtime it can temporarily double disk
space usage. It is best to run major compactions, if at all, at times of
low demand on the cluster.

On Apr 4, 2011, at 1:57 PM, aaron morton wrote:

cleanup reads each SSTable on disk and writes a new file that containsthe same data with the exception of rows that are no longer in a tokenrange the node is a replica for. It's not compacting the files into fewerfiles or purging tombstones. But it is re-writing all the data for the CF.

Part of the process will trigger GC if needed to free up disk space fromSSTables no longer needed.

AFAIK having fewer bigger files will not cause longer minor compactions.Compaction thresholds are applied per bucket of files that share asimilar size, there is normally more smaller files and fewer larger files.

Aaron

On 2 Apr 2011, at 01:45, Jonathan Colby wrote:

I discovered that a Garbage collection cleans up the unused old SSTables.But I still wonder whether cleanup really does a full compaction. Thiswould be undesirable if so.

On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote:

I ran node cleanup on a node in my cluster and discovered the disk usagewent from 3.3 GB to 5.4 GB. Why is this?

I thought cleanup just removed hinted handoff information. I read that*during* cleanup extra disk space will be used similar to a compaction.But I was expecting the disk usage to go back down when it finished.

I hope cleanup doesn't trigger a major compaction. I'd rather not runmajor compactions because it means future minor compactions will takelonger and use more CPU and disk.

Re: Re: nodetool cleanup - results in more disk use?

Reply via email to