We think we're running into a situation where we've deleted all the columns on several thousand rows but they still show up in the results of our pig scripts. We think that's a product of range ghosts because ColumnFamilyRecordReader uses getRangeSlices. So that might be a problem for people and I think we have something that might address that.
What if we were to have a hadoop job specific option to have the CFRR filter out rows returned that don't contain any columns? It's true that it used to do that in core Cassandra and was removed as a feature because of the performance penalty. However with hadoop type loads, latency isn't as big of a deal. That and it could be a job specific option. Also, for CFRR there's the option for a SlicePredicate. In addition to being able to suppress range ghosts, it could also skip rows that had no data for that SlicePredicate, which would also be a nice feature - since it might have similar undesirable consequences. True the person doing the MapReduce job or the pig script or whatever could deal with it at that level. However, this is core enough and could could be optional so that people wouldn't have to do checking all over the place for keys without any columns. Would such an option be okay to add to the hadoop config and to the CFRR? Jeremy