I appear to have a problem illustrated by https://issues.apache.org/jira/browse/CASSANDRA-1955. At low data rates, I'm seeing mutation messages dropped because writers are blocked as I get a storm of memtables being flushed. OpsCenter memtables seem to also contribute to this:
INFO [OptionalTasks:1] 2013-08-23 01:53:58,522 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-runratecountforiczone@1281182121(14976/120803 serialized/live bytes, 360 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,523 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-runratecountforchannel@705923070(278200/1048576 serialized/live bytes, 6832 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,525 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-solr_resources@1615459594(66362/66362 serialized/live bytes, 4 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,525 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-scheduleddaychannelie@393647337(33203968/36700160 serialized/live bytes, 865620 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,530 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-failediecountfornetwork@1781160199(8680/124903 serialized/live bytes, 273 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,530 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-rollups7200@37425413(6504/236666 serialized/live bytes, 271 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,531 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-rollups60@1943691367(638176/1048576 serialized/live bytes, 39894 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,531 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-events@99567005(1133/1133 serialized/live bytes, 39 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,532 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-rollups300@532892022(184296/1048576 serialized/live bytes, 7679 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,532 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-ie@1309405764(457390051/152043520 serialized/live bytes, 16956160 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,823 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-videoexpectedformat@1530999508(684/24557 serialized/live bytes, 12453 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:58,929 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-failediecountforzone@411870848(9200/95294 serialized/live bytes, 284 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:59,012 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-rollups86400@744253892(456/456 serialized/live bytes, 19 ops) INFO [OptionalTasks:1] 2013-08-23 01:53:59,364 ColumnFamilyStore.java (line 630) Enqueuing flush of Memtable-peers@2024878954(2006/40629 serialized/live bytes, 452 ops) I had a tpstats running across all the nodes in my cluster every 5 seconds or so and observe the following: 2013-08-23T01:53:47 192.168.131.227 FlushWriter 0 0 33 0 0 2013-08-23T01:53:55 192.168.131.227 FlushWriter 0 0 33 0 0 2013-08-23T01:54:00 192.168.131.227 FlushWriter 2 10 37 1 5 2013-08-23T01:54:07 192.168.131.227 FlushWriter 1 1 53 0 11 2013-08-23T01:54:12 192.168.131.227 FlushWriter 1 1 53 0 11 Now I can increase memtable_flush_queue_size, but it seems based on the above that in order to solve the problem, I need to set this to count(CF). What's the downside of this approach? It seems a backwards solution to the real problem...