Hey all, I know there are several tickets in the pipe that should make it possible do use secondary indexes to run map reduce jobs that do not have to ingest the entire dataset such as:
https://issues.apache.org/jira/browse/CASSANDRA-1600 I had ended up creating a sharded secondary index in user space (I just call it ordered buckets), described here: http://www.slideshare.net/edwardcapriolo/casbase-presentation/27 Looking at the ordered buckets implementation I realized it is a perfect candidate for "efficient map reduce" since it is easy to split. A unit test of that implementation is here: https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java With this you can current do efficient map reduce on cassandra data, while waiting for other integrated solutions to come along.