Re: data aggregation in Cassandra

aaron morton Sun, 27 Mar 2011 14:27:43 -0700

You can do range based questions inside of one row. For example one row has all 
of the observations for one day, each observation is represented as a column 
where (at least the start of the name) is the time of the observation. You can 
have to 2 billion columns in one row, and the column names are sorted according 
to the comparator you specify.


If you were to use OOP and say use a time stamp for the key it's going to be 
difficult to balance the ring. The new writes will happen in the highest range 
of the ring, so they would be concentrated in the last few nodes in your ring. 

A lot depends on your work load. But I would recommend starting with the RP and 
partitioning the data into rows based on something like a day. 
  
Hope that helps.
Aaron
On 27 Mar 2011, at 15:49, Saurabh Sehgal wrote:

> Thanks for the reply. The reason I want to go with OPP is to do range based 
> queries on time. All queries against the data are going to be time based. 
> With an RPP partitioning scheme, will it be efficient to do range based 
> queries ? 
> 
> On Mar 26, 2011 9:12 PM, "aaron morton" <aa...@thelastpickle.com> wrote:
> > If you are using OPP you will need to understand how to balance the data 
> > around the ring, start with RP until you have an idea why it's now working 
> > for you. The RP will transform the key with a hash function, which is then 
> > compared to the node tokens to locate the first replica for the data. The 
> > OPP uses the raw key. see 
> > http://wiki.apache.org/cassandra/Operations#Ring_management and 
> > http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
> > 
> > 
> > Reading 20 to 30 million records will take a while. Perhaps look at 
> > http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
> >  and http://www.datastax.com/products/brisk for background. 
> > 
> > Consider how you can dernormalise to support your queries. e.g. in a CF use 
> > keys such as "attr1/value" column name as the time stamp and value as the 
> > stuff you need (you could pack all the data you need into a structure like 
> > JSON )
> > 
> > CF's have a (potentially) large memory overhead. Use fewer and store mixed 
> > but related content in them. 
> > 
> > Hope that helps. 
> > Aaron
> > 
> > 
> > On 26 Mar 2011, at 05:38, Saurabh Sehgal wrote:
> > 
> >> Thanks for all the responses. 
> >> 
> >> My leading questions then are ->
> >> 
> >> - Should I go with the OrderPreservingPartitioner based on timestamps so I 
> >> can do time range queries - is this recommended ? any special cases 
> >> regarding load balancing I need to keep in mind ? I have read buzz over 
> >> blogs/forums on how RandomPartitioner yields better load balancing, and it 
> >> is discouraged to use OrderPreservingPartitioner. Can someone 
> >> expand/comment on this ?
> >> 
> >> - Also, lets say I query all partitioned data between timestampuuid1 and 
> >> timestampuuid2 (over several weeks) .. this would potentially , in my 
> >> case, return anywhere to 20 - 30 million records. How would I go about 
> >> aggregating this data "by hand" ? Will this perform ?
> >> 
> >> Since I am only interested in aggregating over a finite set of 10-20 
> >> attributes. Does it make more sense to have a column family per finite 
> >> attribute ? In this case, I do not need to do any aggregation, since all 
> >> the data for that attribute resides in one column family. Is there an 
> >> upper bound to the number of column families Cassandra currently supports ?
> >> 
> >> 
> >> 
> >> On Fri, Mar 25, 2011 at 7:31 AM, buddhasystem <potek...@bnl.gov> wrote:
> >> Hello Saurabh,
> >> 
> >> I have a similar situation, with a more complex data model, and I do an
> >> equivalent of map-reduce "by hand". The redeeming value is that you have
> >> complete freedom in how you hash, and you design the way you store indexes
> >> and similar structures. If there is a pattern in data store, you use it to
> >> your advantage. In the end, you get good performance.
> >> 
> >> --
> >> View this message in context: 
> >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/data-aggregation-in-Cassandra-tp6206994p6207879.html
> >> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> >> Nabble.com.
> >

Re: data aggregation in Cassandra

Reply via email to