(adding dev@) > (2) Can we implement multi-thread compaction?
I think this is the only way to scale. Or at least to implement concurrent compaction (whether it is by division into threads or not) of multiple size classes. As long as the worst-case compactions are significantly slower than best-base compactions, then presumably you will have the problem of accumulation of lots of sstables during long compactions. Since having few sstables is part of the design goal (or so I have assumed, or else you will seek to much on disk when doing e.g. a range query), triggering situations where this is not the case is a performance problem for readers. I've been thinking about this for a bit and maybe there could be one tweakable configuration setting which sets the desired machine concurrency, that the user tweaks in order to make compaction fast enough in relation to incoming writes. Regardless of the database size, this is necessary whenever cassandra is able to take writes faster than a CPU-bound compaction thread is able to process them. The other thing would be to have an intelligent compaction scheduler that does something along the lines of scheduling a compaction thread for every "level" of compaction (i.e., one for log_m(n) = 1, one for log_m(n) = 2, etc). To avoid inefficiency and huge spikes in CPU usage, these compaction threads could stop every now and then (something reasonable; say ever 100 mb compacted or something) and yield to other compaction threads. This way: (a) a limited amount of threads will be actively runnable at any given moment, allowing the user to limit the effect of background compaction can have on CPU usage (b) but on the other hand, it also means that more than one CPU can be used; whatever is appropriate for the cluster (c) it should be reasonably easy to implement because each compaction is just a regular thread doing what it does now already (d) the synchronization overhead between compaction threads should be completely irrelevant as long as one selects a high enough synchronization threshold (100 mb was just a suggestion; might be 1 gig). (e) log_m(n) will never be large enough for it to be a scaling problem that you have one thread per "level" Thoughts? -- / Peter Schuller