On Tue, 2012-05-01 at 11:00 -0700, Aaron Turner wrote: > Tens or a few hundred MB per row seems reasonable. You could do > thousands/MB if you wanted to, but that can make things harder to > manage.
thanks (Both Aarons) > Depending on the size of your data, you may find that the overhead of > each column becomes significant; far more then the per-row overhead. > Since all of my data is just 64bit integers, I ended up taking a days > worth of values (288/day @ 5min intervals) and storing it as a single > column as a vector. By "vector" do you mean a raw binary array of long ints? That sounds very nice for reducing overhead - but I'd like to to work with counters (I was going to rely on them for streaming "real-time" updates). Is that why you've got the two CFs described below (to have an archived summary and a live version that can have counters), or do you have no contention over writes/increments for individual values? > Hence I have two CF's: > > StatsDaily -- each row == 1 day, each column = 1 stat @ 5min intervals > StatsDailyVector -- each row == 1 year, each column = 288 stats @ 1 > day intervals > > Every night a job kicks off and converts each row's worth of > StatsDaily into a column in StatsDailyVector. By doing it 1:1 this > way, I also reduce the number of tombstones I need to write in > StatsDaily since I only need one tombstone for the row delete, rather > then 288 for each column deleted. > > I don't use compression.