map reduce job over indexed range of keys

Matt Kennedy Wed, 23 Feb 2011 14:31:39 -0800

Let me start out by saying that I think I'm going to have to write a patch
to get what I want, but I'm fine with that.  I just wanted to check here
first to make sure that I'm not missing something obvious.


I'd like to be able to run a MapReduce job that takes a value in an indexed
column as a parameter, and use that to select the data that the MapReduce
job operates on.  Right now, it looks like this isn't possible because
org.apache.cassandra.hadoop.ColumnFamilyRecordReader will only fetch data
with get_range_slices, not get_indexed_slices.

An example might be useful.  Let's say I want to run a map reduce job over
all the data for a particular country.  Right now I can do this in Map
Reduce by simply discarding all the data that is not from the country I want
to process on. I suspect it will be faster if I can reduce the size of the
Map Reduce job by only selecting the data I want by using secondary indexes
in Cassandra.

So, first question: Am I wrong?  Is there some clever way to enable the
behavior I'm looking for (without modifying the cassandra codebase)?

Second question: If I'm not wrong, should I open a JIRA issue for this and
start coding up this feature?

Finally, the real reason that I want to get this working is so that I can
enhance the CassandraStorage pig loadfunc so that it can take query
parameters on in the URL string that is used to specify the keyspace and
column family.  So for example, you might load data into Pig with this
sytax:

rows = LOAD 'cassandra://mykeyspace/mycolumnfamily?country=UK' using
CassandraStorage();

I'd like to get some feedback on that syntax.

Thanks,
Matt Kennedy

map reduce job over indexed range of keys

Reply via email to