Thanks Aaron, very useful. I'll give some of your suggestions a go...

On 16 May 2011 19:13, aaron morton <aa...@thelastpickle.com> wrote:

> I'd stick with the RandomPartitioner until you have a really good reason to
> change :)
>
> I'd also go with your alternative design with some possible tweaks.
>
> Consider partitioning the rows  by year or some other sensible value. If
> you will generally be getting the most recent data this can reduce the need
> for cassandra to read SSTables that contain the row key, but do not contain
> any required columns.
>
> Depending on how the data is collected, consider storing all the data
> collected for a certain data in a single columns using sometime like JSON.
> This would allow you to have a single column for each observation. This
> makes it easier to use a SliceRange to get say all the observations from
> 01/05/2011
>
> If you often want to read certain keys for a single day (or a few days)
> consider pivoting the data so the key is the date and the columns are the
> current row keys.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 15 May 2011, at 19:56, Charles Blaxland wrote:
>
> > Hi All,
> >
> > New to Cassandra, so apologies if I don't fully grok stuff just yet.
> >
> > I have data keyed by a key as well as a date. I want to run a query to
> get multiple keys across multiple contiguous date ranges simultaneously. I'm
> currently storing the date along with the row key like this:
> >
> > key1|2011-05-15 {  c1 : , c2 :,  c3 : ... }
> > key1|2011-05-16 {  c1 : , c2 :,  c3 : ... }
> > key2|2011-05-15 {  c1 : , c2 :,  c3 : ... }
> > key2|2011-05-16 {  c1 : , c2 :,  c3 : ... }
> > ...
> >
> > I generate all the key/date combinations that I'm interested in and use
> multiget_slice to retrieve them, pulling in all the columns for each key (I
> need all the data, but the number of columns is small: less than 100). The
> total number of row keys retrieved will only be 100 or so.
> >
> > Now it strikes me I could also store this using composite columns, like
> this:
> >
> > key1 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2
> : , 2011-05-15|c3 : , 2011-05-16|c3 : , ... }
> > key2 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2
> : , 2011-05-15|c3 : , 2011-05-16|c3 : , ... }
> > ...
> >
> > Then use multislice_get again (but with less keys), and use a slice range
> to only retrieve the dates I'm interested in.
> >
> > Another alternative I guess would be to use OPP with the first storage
> approach and get_range_slices, but as I understand this would not be great
> for performance due to keys being clustered together on a single node?
> >
> > So my question is, which approach is best? One downside to the latter I
> guess is that the number of columns grows without bound (although with 2
> billion to play with this isn't gonna be  a problem any time soon). Also
> multiget_slice supports only one slice predicate, so I'd guess I'd have to
> use multiple queries to get multiple date ranges.
> >
> > Anyway, any thoughts/tips appreciated.
> >
> > Thanks,
> > Charles
> >
>
>

Reply via email to