When a compaction need to write a file cassandra will try to find a place to 
put the new file, based on an estimate of it's size. If it cannot find enough 
space it will trigger a GC which will delete any previously compacted and so 
unneeded SSTables. The same thing will happen when a new SSTable needs to be 
written to disk. 

Minor Compaction groups the SSTables on disk into buckets of similar sizes 
(http://wiki.apache.org/cassandra/MemtableSSTable) each bucket is processed in 
it's own compaction task. Under 0.7 compaction is single threaded and when each 
compaction task starts it will try to find space on disk and if necessary 
trigger GC to free space. 
 
SSTables are immutable on disk, compaction cannot delete data from them as they 
are also used to serve read requests at the same time. To do so would require 
locking around (regions of) the file.  

Also as far as I understand we cannot immediately delete files because other 
operations (including repair) may be using them. The data in the pre compacted 
files is just as correct as the data in the compacted file, it's just more 
compact. So the easiest thing to do is let the JVM sort out if anything else is 
using them. 

Perhaps it could be improved by actively tracking which files are in use so 
they may be deleted quicker. But right so long as unused space is freed when 
needed it's working as designed AFAIK. 

Thats my understanding, hope it helps explain why it works that way. 
Aaron

On 30 Mar 2011, at 13:32, Sheng Chen wrote:

> Yes.
> I think at least we can remove the tombstones for each sstable first, and 
> then do the merge.
> 
> 2011/3/29 Karl Hiramoto <k...@hiramoto.org>
> Would it be possible to improve the current compaction disk space issue by  
> compacting one only a few SSTables at a time then imediately deleting the old 
> one?  Looking at the logs it seems like deletions of old SSTables are taking 
> longer than necessary.
> 
> --
> Karl
> 

Reply via email to