The bigger the file the longer it will take for it to be part of a
compaction again.
Compacting bucket of large files takes longer then compacting bucket of
small files

Shimi

On Mon, Apr 4, 2011 at 3:58 PM, aaron morton <aa...@thelastpickle.com>wrote:

> mmm, interesting. My theory was....
>
> t0 - major compaction runs, there is now one sstable
> t1 - x new sstables have been created
> t2 - minor compaction runs and determines there are two buckets, one with
> the x new sstables and one with the single big file. The bucket of many
> files is compacted into one, the bucket of one file is ignored.
>
> I can see that it takes longer for the big file to be involved in
> compaction again, and when it finally was it would take more time. But that
> minor compactions of new SSTables would still happen at the same rate,
> especially if they are created at the same rate as previously.
>
> Am I missing something or am I just reading the docs wrong ?
>
> Cheers
> Aaron
>
>
> On 4 Apr 2011, at 22:20, Jonathan Colby wrote:
>
> hi Aaron -
>
> The Datastax documentation brought to light the fact that over time, major
> compactions  will be performed on bigger and bigger SSTables.   They
> actually recommend against performing too many major compactions.  Which is
> why I am wary to trigger too many major compactions ...
>
> http://www.datastax.com/docs/0.7/operations/scheduled_tasks
> Performing Major 
> Compaction¶<http://www.datastax.com/docs/0.7/operations/scheduled_tasks#performing-major-compaction>
>
> A major compaction process merges all SSTables for all column families in a
> keyspace – not just similar sized ones, as in minor compaction. Note that
> this may create extremely large SStables that result in long intervals
> before the next minor compaction (and a resulting increase in CPU usage for
> each minor compaction).
>
> Though a major compaction ultimately frees disk space used by accumulated
> SSTables, during runtime it can temporarily double disk space usage. It is
> best to run major compactions, if at all, at times of low demand on the
> cluster.
>
>
>
>
>
>
> On Apr 4, 2011, at 1:57 PM, aaron morton wrote:
>
> cleanup reads each SSTable on disk and writes a new file that contains the
> same data with the exception of rows that are no longer in a token range the
> node is a replica for. It's not compacting the files into fewer files or
> purging tombstones. But it is re-writing all the data for the CF.
>
> Part of the process will trigger GC if needed to free up disk space from
> SSTables no longer needed.
>
> AFAIK having fewer bigger files will not cause longer minor compactions.
> Compaction thresholds are applied per bucket of files that share a similar
> size, there is normally more smaller files and fewer larger files.
>
> Aaron
>
> On 2 Apr 2011, at 01:45, Jonathan Colby wrote:
>
> I discovered that a Garbage collection cleans up the unused old SSTables.
>   But I still wonder whether cleanup really does a full compaction.  This
> would be undesirable if so.
>
>
>
> On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote:
>
>
> I ran node cleanup on a node in my cluster and discovered the disk usage
> went from 3.3 GB to 5.4 GB.  Why is this?
>
>
> I thought cleanup just removed hinted handoff information.   I read that
> *during* cleanup extra disk space will be used similar to a compaction.  But
> I was expecting the disk usage to go back down when it finished.
>
>
> I hope cleanup doesn't trigger a major compaction.  I'd rather not run
> major compactions because it means future minor compactions will take longer
> and use more CPU and disk.
>
>
>
>
>
>
>
>

Reply via email to