Hey all,

I know there are several tickets in the pipe that should make it possible
do use secondary indexes to run map reduce jobs that do not have to ingest
the entire dataset such as:

https://issues.apache.org/jira/browse/CASSANDRA-1600

I had ended up creating a sharded secondary index in user space (I just
call it ordered buckets), described here:

http://www.slideshare.net/edwardcapriolo/casbase-presentation/27

Looking at the ordered buckets implementation I realized it is a perfect
candidate for "efficient map reduce" since it is easy to split.

A unit test of that implementation is here:

https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java

With this you can current do efficient map reduce on cassandra data, while
waiting for other integrated solutions to come along.

Reply via email to