[multiple active cf;s, often triggering flush at the same time] > Can anyone confirm whether or not this behaviour is expected, and > suggest anything that I could do about it? This is on 0.6.6, by the way. > Patched with time-to-live code, if that makes a difference.
I looked at the code (trunk though, not 0.6.6) and was a bit surprised. There seems to be a single shared (static) executor for the sorting and writing stages of memtable flushing (so far so good). But what I didn't expect was that they seem to have a work queue of a size equal to the concurrency. In the case of the writer, the concurrency is the memtable_flush_writers option (not available in 0.6.6). For the sorter, it is the number of CPU cores on the system. This makes sense for the concurrency aspect. If my understanding is correct and I am not missing something else, this means that for multiple column families you do indeed need to expect to have this problem. The more column families the greater the probability. What I expected to find was to see that each cf would be guaranteed to have at least one memtable in queue before writes would block for that cf. Assuming the same holds true in your case on 0.6.6 (it looks to be so on the 0.6 branch by quick examination), I would have to assume that either one of the following is true: (1) You have more cf:s actively written to than the number of CPU cores on your machine so that you're waiting on flushSorter. or (2) Your write speed is overall higher than what can be sustained by an sstable writer. If you are willing to patch Cassandra and do the appropriate testing, and are find with the implications on heap size, you should be able to work around this by adjusting the size of the work queues for the flushSorter and flushWriter in ColumnFamilyStory.java. Note that I did not test this, so proceed with caution if you do. It will definitely mean that you will eat more heap space if you submit writes to the cluster faster than they are processed. So in particular if you're relying on backpressure mechanisms to avoid causing problems when you do non-rate-limited writes to the cluster, results are probably negative. I'll file a bug about this to (1) elicit feedback if I'm wrong, and (2) to fix it. -- / Peter Schuller