Hi, I found that our cluster repeats compacting a single file forever (cassandra 0.7.5). We are wondering if compaction logic is wrong. I'd like to have comments from you guys.
Situation: - After trying to repair a column family, our cluster's disk usage is quite high. Cassandra cannot compact all sstables at once. I think it repeats compacting single file at the end. (you can check the attached log below) - Our data doesn't have deletes. So, the compaction of single file doesn't make free disk space. We are approaching to full-disk. But I believe that the repair operation made a lot of duplicate data on the disk and it requires compaction. However, most of nodes stuck on compacting a single file. The only thing we can do is to restart the nodes. My question is why the compaction doesn't stop. I looked at the logic in CompactionManager.java: ----------------- String compactionFileLocation = table.getDataFileLocation(cfs.getExpectedCompactedFileSize(sstables)); // If the compaction file path is null that means we have no space left for this compaction. // try again w/o the largest one. List<SSTableReader> smallerSSTables = new ArrayList<SSTableReader>(sstables); while (compactionFileLocation == null && smallerSSTables.size() > 1) { logger.warn("insufficient space to compact all requested files " + StringUtils.join(smallerSSTables, ", ")); smallerSSTables.remove(cfs.getMaxSizeFile(smallerSSTables)); compactionFileLocation = table.getDataFileLocation(cfs.getExpectedCompactedFileSize(smallerSSTables)); } if (compactionFileLocation == null) { logger.error("insufficient space to compact even the two smallest files, aborting"); return 0; } ----------------- The while condition: smallerSSTables.size() > 1 Is this should be "smallerSSTables.size() > 2" ? In my understanding, compaction of single file makes free disk space only when the sstable has a lot of tombstone and only if the tombstone is removed in the compaction. If cassandra knows the sstable has tombstones to be removed, it's worth to compact it. Otherwise, it might makes free space if you are lucky. In worst case, it leads to infinite loop like our case. What do you think the code change? Best regards, Shotaro * Cassandra compaction log ------------------------- WARN [CompactionExecutor:1] 2011-04-20 01:03:14,446 CompactionManager.java (line 405) insufficient space to compact all requested files SSTableReader( path='foobar-f-3020-Data.db'), SSTableReader(path='foobar-f-3034-Data.db') INFO [CompactionExecutor:1] 2011-04-20 03:47:29,833 CompactionManager.java (line 482) Compacted to foobar-tmp-f-3035-Data.db. 260,646,760,319 to 260,646,760,319 (~100% of original) bytes for 6,893,896 keys. Time: 9,855,385ms. WARN [CompactionExecutor:1] 2011-04-20 03:48:11,308 CompactionManager.java (line 405) insufficient space to compact all requested files SSTableReader(path='foobar-f-3020-Data.db'), SSTableReader(path='foobar-f-3035-Data.db') INFO [CompactionExecutor:1] 2011-04-20 06:31:41,193 CompactionManager.java (line 482) Compacted to foobar-tmp-f-3036-Data.db. 260,646,760,319 to 260,646,760,319 (~100% of original) bytes for 6,893,896 keys. Time: 9,809,882ms. WARN [CompactionExecutor:1] 2011-04-20 06:32:22,476 CompactionManager.java (line 405) insufficient space to compact all requested files SSTableReader(path='foobar-f-3020-Data.db'), SSTableReader(path='foobar-f-3036-Data.db') INFO [CompactionExecutor:1] 2011-04-20 09:20:29,903 CompactionManager.java (line 482) Compacted to foobar-tmp-f-3037-Data.db. 260,646,760,319 to 260,646,760,319 (~100% of original) bytes for 6,893,896 keys. Time: 10,087,424ms. ------------------------- You can see that compacted size is always the same. It repeats compacting the same single sstable.