Thanks Jonathan. yes I did notice the RF issue, and thought , for example, to get a total salary, you'd need to divide it by RF, something like that.
I'll take a look at 1608, Yang On Sun, Jun 19, 2011 at 12:12 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > I'm skeptical that this is the right place to do M/R jobs (multiple > replicas mean you'll do the work multiple times, if you have the same > code on all nodes... and different code on the nodes could get messy > fast.) > > But, the work-in-progress patches on CASSANDRA-1608 include a > compaction pub/sub component. So you could create a subclass of your > desired compaction strategy that adds a "notify reduce job" hook and > try it out that way. > > On Sun, Jun 19, 2011 at 1:59 AM, Yang <teddyyyy...@gmail.com> wrote: > > I realize that the SSTable flush/compaction process is essentially > > equivalent to the reduce stage of Map-Reduce, > > since entries of same keys are grouped together. > > we have felt the need to do MR-style jobs on the data already stored in > > cassandra, it would be very useful to > > provide a hook into the compaction process so that the reduce job can be > > done. for example, jobs as simple as > > dumping out all the keys in a system, or for a CF with userId being the > key, > > and salary as a column, calculate the total > > salary. > > this is different from what BRISK does, since BRISK only uses CF as a > > physical block storage, and does not > > utilize the data already stored in Cassandra, which has been grouped by > > keys. > > it is possible to come up with some sort of framework to scrape sstables > to > > carry out the MR jobs, but the compaction > > hook seems an easier and faster way to get this done, given existing > systems > > yang > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >