I run a single node cassandra instance, and we have lots of overwrites on a hot CF and disk utilization seems to grow pretty fast. We've noticed that when we restart cassandra disk utilization decreases dramatically (dramatic being something close to 50%). Most of this growth seems to be in the commitlog directory which are replayed when cassandra starts, then removed.
So I understand that writes go to commit log and then to memtable, then to SSTable. I'm curious when the CommitLogs get cleaned up, is it during a compaction or is it when everything in the commit log is written to SSTable? Is there an easy way to keep commit log size down without killing performance? I've read this: http://wiki.apache.org/cassandra/MemtableThresholds Since larger memtables help to absorb overwrites, I'd like to increase MemTableThroughputInMB and maybe play with MemtableOperationsInMillions as well, but I'm wondering if this will lead to even more dramatic disk utilization in the commitlog directory. It seems like larger memtables would naturally mean more disk utilization by the commit logs. Our write load is very predictable and always the same, tons of writes for time series statistics every 5 minutes. While I'm fine with temporary commit logs growing in size, I'm wondering if we should be forcing compactions, forcing GC, or doing some form of cleanup to keep them from getting too big. Mainly I just need to know how much disk utilization I can expect from a given number of writes, and I'm wondering if there is some "fudge factor" I should account for with commit logs. Any advice appreciated. Thanks, -Derek
smime.p7s
Description: S/MIME cryptographic signature