Hey, I'm sorry, I think I didnt make myself clear enough. I'm using cassandra only the store the _results_ (the calculated time series) not the source data. Also using "Beginner Assumptions" as the Subject propably wasnt the best choice since I'm more interested in the inner workings of cassandra than how to use it. ;)
> And the per hour counts are stored as json? No, they are stored as byte arrays with a fixed size (96 = 24x4byte integers). > cassandra.get("/page/1", Slice("20100612"..."20100613")) I know how to do it in cassandra, I just was comparing it to others. I was interested to know if cassandra.get("/page/1", :start => "20100612", :count => 90) is actually just as fast as cassandra.get("/page/1", Slice("20100612", "20100613", ...)) with 90 keys > >> Assumption #3: > I doubt you data will grow at a fixed rate per row. (Unless you have > always the same hit pattern for your pages) But you should be able to > able to calculated the maximal required storage requirement. That said > - I am wondering... where are you aggregating the counts per hour? The Data is currently just stored in logfiles which are parsed once an hour in a map/reduce like fashion (not stored in cassandra). Even if there are no values to be saved there will still be a column for this row with [0, 0, 0, ...]. I also do not need to increment any of those counters live. Hit Patterns dont matter since 1million views per hour consume just the same space as 0 views (96 bytes fixed). I may at some time remove the 0 values to save space but right now there is always one column per day per row. > > So you want to increment those counters per hit? I don't think there > is an atomic increment semantic in cassandra yet. (Some one else to > confirm?) No, see above. Each View generates one entry in a logfile which is append only (much like the cassandra commitlog). Incrementing those counters live is very unlikely to happen, since they are just one part of the whole log map/reduce thing. The offline processing part is not moving into cassandra anytime soon, I just wanna put the results somewhere. SQL is fine for that (atm) but I was interested in some NoSQL and this seemed like a good usecase (very structed data, only accessed by keys or key ranges but the key is always known, aka no dynamic queries) Cheers, /thomas