Let me start out by saying that I think I'm going to have to write a patch to get what I want, but I'm fine with that. I just wanted to check here first to make sure that I'm not missing something obvious.
I'd like to be able to run a MapReduce job that takes a value in an indexed column as a parameter, and use that to select the data that the MapReduce job operates on. Right now, it looks like this isn't possible because org.apache.cassandra.hadoop.ColumnFamilyRecordReader will only fetch data with get_range_slices, not get_indexed_slices. An example might be useful. Let's say I want to run a map reduce job over all the data for a particular country. Right now I can do this in Map Reduce by simply discarding all the data that is not from the country I want to process on. I suspect it will be faster if I can reduce the size of the Map Reduce job by only selecting the data I want by using secondary indexes in Cassandra. So, first question: Am I wrong? Is there some clever way to enable the behavior I'm looking for (without modifying the cassandra codebase)? Second question: If I'm not wrong, should I open a JIRA issue for this and start coding up this feature? Finally, the real reason that I want to get this working is so that I can enhance the CassandraStorage pig loadfunc so that it can take query parameters on in the URL string that is used to specify the keyspace and column family. So for example, you might load data into Pig with this sytax: rows = LOAD 'cassandra://mykeyspace/mycolumnfamily?country=UK' using CassandraStorage(); I'd like to get some feedback on that syntax. Thanks, Matt Kennedy