[multiple active cf;s, often triggering flush at the same time]

> Can anyone confirm whether or not this behaviour is expected, and
> suggest anything that I could do about it? This is on 0.6.6, by the way.
> Patched with time-to-live code, if that makes a difference.

I looked at the code (trunk though, not 0.6.6) and was a bit
surprised. There seems to be a single shared (static) executor for the
sorting and writing stages of memtable flushing (so far so good). But
what I didn't expect was that they seem to have a work queue of a size
equal to the concurrency.

In the case of the writer, the concurrency is the
memtable_flush_writers option (not available in 0.6.6). For the
sorter, it is the number of CPU cores on the system. This makes sense
for the concurrency aspect.

If my understanding is correct and I am not missing something else,
this means that for multiple column families you do indeed need to
expect to have this problem. The more column families the greater the
probability.

What I expected to find was to see that each cf would be guaranteed to
have at least one memtable in queue before writes would block for that
cf.

Assuming the same holds true in your case on 0.6.6 (it looks to be so
on the 0.6 branch by quick examination), I would have to assume that
either one of the following is true:

(1) You have more cf:s actively written to than the number of CPU
cores on your machine so that you're waiting on flushSorter.
  or
(2) Your write speed is overall higher than what can be sustained by
an sstable writer.

If you are willing to patch Cassandra and do the appropriate testing,
and are find with the implications on heap size, you should be able to
work around this by adjusting the size of the work queues for the
flushSorter and flushWriter in ColumnFamilyStory.java.

Note that I did not test this, so proceed with caution if you do.

It will definitely mean that you will eat more heap space if you
submit writes to the cluster faster than they are processed. So in
particular if you're relying on backpressure mechanisms to avoid
causing problems when you do non-rate-limited writes to the cluster,
results are probably negative.

I'll file a bug about this to (1) elicit feedback if I'm wrong, and
(2) to fix it.

-- 
/ Peter Schuller

Reply via email to