If you are using OPP you will need to understand how to balance the data around 
the ring, start with RP until you have an idea why it's now working for you. 
The RP will  transform the key with a hash function, which is then compared to 
the node tokens to locate the first replica for the data. The OPP uses the raw 
key. see http://wiki.apache.org/cassandra/Operations#Ring_management and 
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

 
Reading 20 to 30 million records will take a while. Perhaps look at 
http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
 and http://www.datastax.com/products/brisk for background. 

Consider how you can dernormalise to support your queries. e.g. in a CF use 
keys such as "attr1/value" column name as the time stamp and value as the stuff 
you need (you could pack all the data you need into a structure like JSON )

CF's have a (potentially) large memory overhead. Use fewer and store mixed but 
related content in them. 
  
Hope that helps. 
Aaron


On 26 Mar 2011, at 05:38, Saurabh Sehgal wrote:

> Thanks for all the responses. 
> 
> My leading questions then are ->
> 
> - Should I go with the OrderPreservingPartitioner based on timestamps so I 
> can do time range queries - is this recommended ? any special cases regarding 
> load balancing I need to keep in mind ? I have read buzz over blogs/forums on 
> how RandomPartitioner yields better load balancing, and it is discouraged to 
> use OrderPreservingPartitioner. Can someone expand/comment on this ?
> 
> - Also, lets say I query all partitioned data between timestampuuid1 and 
> timestampuuid2 (over several weeks) .. this would potentially , in my case, 
> return anywhere to 20 - 30 million records. How would I go about aggregating 
> this data "by hand" ? Will this perform ?
> 
> Since I am only interested in aggregating over a finite set of 10-20 
> attributes. Does it make more sense to have a column family per finite 
> attribute ? In this case, I do not need to do any aggregation, since all the 
> data for that attribute resides in one column family. Is there an upper bound 
> to the number of column families Cassandra currently supports ?
> 
> 
> 
> On Fri, Mar 25, 2011 at 7:31 AM, buddhasystem <potek...@bnl.gov> wrote:
> Hello Saurabh,
> 
> I have a similar situation, with a more complex data model, and I do an
> equivalent of map-reduce "by hand". The redeeming value is that you have
> complete freedom in how you hash, and you design the way you store indexes
> and similar structures. If there is a pattern in data store, you use it to
> your advantage. In the end, you get good performance.
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/data-aggregation-in-Cassandra-tp6206994p6207879.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.

Reply via email to