The bigger the file the longer it will take for it to be part of a compaction again. Compacting bucket of large files takes longer then compacting bucket of small files
Shimi On Mon, Apr 4, 2011 at 3:58 PM, aaron morton <aa...@thelastpickle.com>wrote: > mmm, interesting. My theory was.... > > t0 - major compaction runs, there is now one sstable > t1 - x new sstables have been created > t2 - minor compaction runs and determines there are two buckets, one with > the x new sstables and one with the single big file. The bucket of many > files is compacted into one, the bucket of one file is ignored. > > I can see that it takes longer for the big file to be involved in > compaction again, and when it finally was it would take more time. But that > minor compactions of new SSTables would still happen at the same rate, > especially if they are created at the same rate as previously. > > Am I missing something or am I just reading the docs wrong ? > > Cheers > Aaron > > > On 4 Apr 2011, at 22:20, Jonathan Colby wrote: > > hi Aaron - > > The Datastax documentation brought to light the fact that over time, major > compactions will be performed on bigger and bigger SSTables. They > actually recommend against performing too many major compactions. Which is > why I am wary to trigger too many major compactions ... > > http://www.datastax.com/docs/0.7/operations/scheduled_tasks > Performing Major > Compaction¶<http://www.datastax.com/docs/0.7/operations/scheduled_tasks#performing-major-compaction> > > A major compaction process merges all SSTables for all column families in a > keyspace – not just similar sized ones, as in minor compaction. Note that > this may create extremely large SStables that result in long intervals > before the next minor compaction (and a resulting increase in CPU usage for > each minor compaction). > > Though a major compaction ultimately frees disk space used by accumulated > SSTables, during runtime it can temporarily double disk space usage. It is > best to run major compactions, if at all, at times of low demand on the > cluster. > > > > > > > On Apr 4, 2011, at 1:57 PM, aaron morton wrote: > > cleanup reads each SSTable on disk and writes a new file that contains the > same data with the exception of rows that are no longer in a token range the > node is a replica for. It's not compacting the files into fewer files or > purging tombstones. But it is re-writing all the data for the CF. > > Part of the process will trigger GC if needed to free up disk space from > SSTables no longer needed. > > AFAIK having fewer bigger files will not cause longer minor compactions. > Compaction thresholds are applied per bucket of files that share a similar > size, there is normally more smaller files and fewer larger files. > > Aaron > > On 2 Apr 2011, at 01:45, Jonathan Colby wrote: > > I discovered that a Garbage collection cleans up the unused old SSTables. > But I still wonder whether cleanup really does a full compaction. This > would be undesirable if so. > > > > On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote: > > > I ran node cleanup on a node in my cluster and discovered the disk usage > went from 3.3 GB to 5.4 GB. Why is this? > > > I thought cleanup just removed hinted handoff information. I read that > *during* cleanup extra disk space will be used similar to a compaction. But > I was expecting the disk usage to go back down when it finished. > > > I hope cleanup doesn't trigger a major compaction. I'd rather not run > major compactions because it means future minor compactions will take longer > and use more CPU and disk. > > > > > > > >