Thanks Aaron, very useful. I'll give some of your suggestions a go... On 16 May 2011 19:13, aaron morton <aa...@thelastpickle.com> wrote:
> I'd stick with the RandomPartitioner until you have a really good reason to > change :) > > I'd also go with your alternative design with some possible tweaks. > > Consider partitioning the rows by year or some other sensible value. If > you will generally be getting the most recent data this can reduce the need > for cassandra to read SSTables that contain the row key, but do not contain > any required columns. > > Depending on how the data is collected, consider storing all the data > collected for a certain data in a single columns using sometime like JSON. > This would allow you to have a single column for each observation. This > makes it easier to use a SliceRange to get say all the observations from > 01/05/2011 > > If you often want to read certain keys for a single day (or a few days) > consider pivoting the data so the key is the date and the columns are the > current row keys. > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 15 May 2011, at 19:56, Charles Blaxland wrote: > > > Hi All, > > > > New to Cassandra, so apologies if I don't fully grok stuff just yet. > > > > I have data keyed by a key as well as a date. I want to run a query to > get multiple keys across multiple contiguous date ranges simultaneously. I'm > currently storing the date along with the row key like this: > > > > key1|2011-05-15 { c1 : , c2 :, c3 : ... } > > key1|2011-05-16 { c1 : , c2 :, c3 : ... } > > key2|2011-05-15 { c1 : , c2 :, c3 : ... } > > key2|2011-05-16 { c1 : , c2 :, c3 : ... } > > ... > > > > I generate all the key/date combinations that I'm interested in and use > multiget_slice to retrieve them, pulling in all the columns for each key (I > need all the data, but the number of columns is small: less than 100). The > total number of row keys retrieved will only be 100 or so. > > > > Now it strikes me I could also store this using composite columns, like > this: > > > > key1 { 2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 > : , 2011-05-15|c3 : , 2011-05-16|c3 : , ... } > > key2 { 2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 > : , 2011-05-15|c3 : , 2011-05-16|c3 : , ... } > > ... > > > > Then use multislice_get again (but with less keys), and use a slice range > to only retrieve the dates I'm interested in. > > > > Another alternative I guess would be to use OPP with the first storage > approach and get_range_slices, but as I understand this would not be great > for performance due to keys being clustered together on a single node? > > > > So my question is, which approach is best? One downside to the latter I > guess is that the number of columns grows without bound (although with 2 > billion to play with this isn't gonna be a problem any time soon). Also > multiget_slice supports only one slice predicate, so I'd guess I'd have to > use multiple queries to get multiple date ranges. > > > > Anyway, any thoughts/tips appreciated. > > > > Thanks, > > Charles > > > >