> 86GB in commitlog and 42GB in data

 

Whoa, that seems really wrong, particularly given your data spans 13 months.
Have you changed any of the default cassandra.yaml setting? What is the
maximum memtable_flush_after across all your CFs? Any warnings/errors in the
Cassandra log?

 

> Out of curiosity, why do global flush thresholds work better than per-CF
settings?  My first thought is that I would want finer grained controls as
my CFs can be extremely different in write/read patterns.

 

By 'work better' I mean maximize memtable sizes (ie minimize flushing)
without causing memory problems. The main reason to play with per-cf
settings is to cause them to flush more than required which is generally not
what you want to do (unless flushes are *currently* being triggered by the
per-cf settings).

 

Dan

 

From: Derek Andree [mailto:dand...@lacunasystems.com] 
Sent: August-29-11 23:20
To: user@cassandra.apache.org
Subject: Re: Disk usage for CommitLog

 

Thanks Dan, good info.

> First off, what version of Cassandra are you using?

Sorry my bad, 0.8.4

> Provided you are using a recent Cassandra version (late 0.7 or 0.8.x) I
doubt the commit log is your problem. My experience using Cassandra as a
time series data store (with a full 30 days of data + various aggregations)
has been that the commit log is a trivial fraction of the actual data. That
said, its highly dependent on how you use your data and when it expires/gets
deleted (with considerations for gc_grace).

We keep 5 minute data on a few thousand "objects" for 13 months.  We also do
"rollup" aggregation for generating longer time period graphs and reports,
very RRD like.  With a few months of data, I see 86GB in commitlog and 42GB
in data. but then again this is while I'm still in data as fast as I can for
a test case, so that may have something to do with it :)

>
> As one final point, as of 0.8, I would not recommend playing with per-CF
flush settings. There are global thresholds which work far better and
account for things like java overhead.
>

Out of curiosity, why do global flush thresholds work better than per-CF
settings?  My first thought is that I would want finer grained controls as
my CFs can be extremely different in write/read patterns.

Thanks,
-Derek 

Reply via email to