I run a single node cassandra instance, and we have lots of overwrites on a hot 
CF and disk utilization seems to grow pretty fast.  We've noticed that when we 
restart cassandra disk utilization decreases dramatically (dramatic being 
something close to 50%).  Most of this growth seems to be in the commitlog 
directory which are replayed when cassandra starts, then removed.

So I understand that writes go to commit log and then to memtable, then to 
SSTable.  I'm curious when the CommitLogs get cleaned up, is it during a 
compaction or is it when everything in the commit log is written to SSTable?  
Is there an easy way to keep commit log size down without killing performance?

I've read this:

http://wiki.apache.org/cassandra/MemtableThresholds

Since larger memtables help to absorb overwrites, I'd like to increase 
MemTableThroughputInMB and maybe play with MemtableOperationsInMillions as 
well, but I'm wondering if this will lead to even more dramatic disk 
utilization in the commitlog directory.  It seems like larger memtables would 
naturally mean more disk utilization by the commit logs.

Our write load is very predictable and always the same, tons of writes for time 
series statistics every 5 minutes.

While I'm fine with temporary commit logs growing in size, I'm wondering if we 
should be forcing compactions, forcing GC, or doing some form of cleanup to 
keep them from getting too big.  Mainly I just need to know how much disk 
utilization I can expect from a given number of writes, and I'm wondering if 
there is some "fudge factor" I should account for with commit logs.  Any advice 
appreciated.

Thanks,
-Derek

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to