Right, so I'm interpreting silence as a confirmation on all points. I opened: https://issues.apache.org/jira/browse/CASSANDRA-2245 https://issues.apache.org/jira/browse/CASSANDRA-2246
to work on these. On Wed, Feb 23, 2011 at 5:31 PM, Matt Kennedy <stinkym...@gmail.com> wrote: > Let me start out by saying that I think I'm going to have to write a patch > to get what I want, but I'm fine with that. I just wanted to check here > first to make sure that I'm not missing something obvious. > > I'd like to be able to run a MapReduce job that takes a value in an indexed > column as a parameter, and use that to select the data that the MapReduce > job operates on. Right now, it looks like this isn't possible because > org.apache.cassandra.hadoop.ColumnFamilyRecordReader will only fetch data > with get_range_slices, not get_indexed_slices. > > An example might be useful. Let's say I want to run a map reduce job over > all the data for a particular country. Right now I can do this in Map > Reduce by simply discarding all the data that is not from the country I want > to process on. I suspect it will be faster if I can reduce the size of the > Map Reduce job by only selecting the data I want by using secondary indexes > in Cassandra. > > So, first question: Am I wrong? Is there some clever way to enable the > behavior I'm looking for (without modifying the cassandra codebase)? > > Second question: If I'm not wrong, should I open a JIRA issue for this and > start coding up this feature? > > Finally, the real reason that I want to get this working is so that I can > enhance the CassandraStorage pig loadfunc so that it can take query > parameters on in the URL string that is used to specify the keyspace and > column family. So for example, you might load data into Pig with this > sytax: > > rows = LOAD 'cassandra://mykeyspace/mycolumnfamily?country=UK' using > CassandraStorage(); > > I'd like to get some feedback on that syntax. > > Thanks, > Matt Kennedy >