When a compaction need to write a file cassandra will try to find a place to put the new file, based on an estimate of it's size. If it cannot find enough space it will trigger a GC which will delete any previously compacted and so unneeded SSTables. The same thing will happen when a new SSTable needs to be written to disk.
Minor Compaction groups the SSTables on disk into buckets of similar sizes (http://wiki.apache.org/cassandra/MemtableSSTable) each bucket is processed in it's own compaction task. Under 0.7 compaction is single threaded and when each compaction task starts it will try to find space on disk and if necessary trigger GC to free space. SSTables are immutable on disk, compaction cannot delete data from them as they are also used to serve read requests at the same time. To do so would require locking around (regions of) the file. Also as far as I understand we cannot immediately delete files because other operations (including repair) may be using them. The data in the pre compacted files is just as correct as the data in the compacted file, it's just more compact. So the easiest thing to do is let the JVM sort out if anything else is using them. Perhaps it could be improved by actively tracking which files are in use so they may be deleted quicker. But right so long as unused space is freed when needed it's working as designed AFAIK. Thats my understanding, hope it helps explain why it works that way. Aaron On 30 Mar 2011, at 13:32, Sheng Chen wrote: > Yes. > I think at least we can remove the tombstones for each sstable first, and > then do the merge. > > 2011/3/29 Karl Hiramoto <k...@hiramoto.org> > Would it be possible to improve the current compaction disk space issue by > compacting one only a few SSTables at a time then imediately deleting the old > one? Looking at the logs it seems like deletions of old SSTables are taking > longer than necessary. > > -- > Karl >