First off, what version of Cassandra are you using? > We've noticed that when we restart cassandra disk utilization decreases dramatically
Presumably you mean 'utilization' as in free space. Specifically on a restart, this type of behavior is likely due to Cassandra deleting compacted SSTables. Compacted and therefore unused SSTables are deleted on A) a full GC, B) when requested due to insufficient space and C) on a restart. > I'm curious when the CommitLogs get cleaned up, is it during a compaction or is it when everything in the commit log is written to SSTable I dont know the specifics of what actually triggers a commit log segment to be deleted but they are eligible for deletion once all the memtable data is flushed (ie, your second point). Default settings are to flush CFs at least every 60 minutes so you should plan for commit logs sticking around for about 60 minutes. Provided you are using a recent Cassandra version (late 0.7 or 0.8.x) I doubt the commit log is your problem. My experience using Cassandra as a time series data store (with a full 30 days of data + various aggregations) has been that the commit log is a trivial fraction of the actual data. That said, its highly dependent on how you use your data and when it expires/gets deleted (with considerations for gc_grace). As one final point, as of 0.8, I would not recommend playing with per-CF flush settings. There are global thresholds which work far better and account for things like java overhead. On Mon, Aug 29, 2011 at 9:04 PM, Derek Andree <dand...@lacunasystems.com>wrote: > I run a single node cassandra instance, and we have lots of overwrites on a > hot CF and disk utilization seems to grow pretty fast. We've noticed that > when we restart cassandra disk utilization decreases dramatically (dramatic > being something close to 50%). Most of this growth seems to be in the > commitlog directory which are replayed when cassandra starts, then removed. > > So I understand that writes go to commit log and then to memtable, then to > SSTable. I'm curious when the CommitLogs get cleaned up, is it during a > compaction or is it when everything in the commit log is written to SSTable? > Is there an easy way to keep commit log size down without killing > performance? > > I've read this: > > http://wiki.apache.org/cassandra/MemtableThresholds > > Since larger memtables help to absorb overwrites, I'd like to increase > MemTableThroughputInMB and maybe play with MemtableOperationsInMillions as > well, but I'm wondering if this will lead to even more dramatic disk > utilization in the commitlog directory. It seems like larger memtables > would naturally mean more disk utilization by the commit logs. > > Our write load is very predictable and always the same, tons of writes for > time series statistics every 5 minutes. > > While I'm fine with temporary commit logs growing in size, I'm wondering if > we should be forcing compactions, forcing GC, or doing some form of cleanup > to keep them from getting too big. Mainly I just need to know how much disk > utilization I can expect from a given number of writes, and I'm wondering if > there is some "fudge factor" I should account for with commit logs. Any > advice appreciated. > > Thanks, > -Derek