On Tue, Jul 10, 2012 at 12:14 PM, Roland Hänel <rol...@haenel.me> wrote:
> Hi, > > I have an application that consists of multiple (possible 1000's) of > measurement series, and each measurement series generates a small amount of > data output (only about 500 bytes) every 10 seconds. This time series of > data should be stored in Cassandra in a fashion that both read access is > possible for a given time range. > > What I do today is > - assign a timeuuid to each data output > - write in two CF: > - first CF has key = measurement series ID, column name = > timeuuid_of_output > - second CF has key = timeuuid_of_output, column value = data > output (~ 500 bytes) > > When someone requests a time range of data, I read the first CF, get a > series of timeuuid's, and then do a row-multiget on the second CF. > > This works great, but tends to be slow for big series of data (lets say > for 10 days, nearly 100,000 records will be requested from the second CF). > This load of 100,000 reads will be distributed through the cluster (because > the second CF scales very nicely with a RandomPartitioner), but more or > less one ends up with 100,000 individual read requests, at least that's > what I suspect. > > Can anyone say if there is a better data model for this type of queries? > Would it be a reasonable improvement to put all data to a single CF with > > - single CF, key = measurement series ID, column name = > timeuuid_of_output, column value = data output > > When I request a series of 100,000 columns from this row (now it's a > single row), can the performance really be better? Is there any chance that > Cassandra will be able to read this data "en bloc" from the hard drive? > This is definitely the approach I would take. Reading a single row is nearly sequential, so you'll get very good performance. I recommend you check these out: - http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ - http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra -- Tyler Hobbs DataStax <http://datastax.com/>