range ghosts and more with hadoop support (with proposed solution)

Jeremy Hanna Fri, 01 Jul 2011 19:09:45 -0700

We think we're running into a situation where we've deleted all the columns on 
several thousand rows but they still show up in the results of our pig scripts. 
 We think that's a product of range ghosts because ColumnFamilyRecordReader 
uses getRangeSlices.  So that might be a problem for people and I think we have 
something that might address that.


What if we were to have a hadoop job specific option to have the CFRR filter 
out rows returned that don't contain any columns?  It's true that it used to do 
that in core Cassandra and was removed as a feature because of the performance 
penalty.  However with hadoop type loads, latency isn't as big of a deal.  That 
and it could be a job specific option.  Also, for CFRR there's the option for a 
SlicePredicate.  In addition to being able to suppress range ghosts, it could 
also skip rows that had no data for that SlicePredicate, which would also be a 
nice feature - since it might have similar undesirable consequences.  True the 
person doing the MapReduce job or the pig script or whatever could deal with it 
at that level.  However, this is core enough and could could be optional so 
that people wouldn't have to do checking all over the place for keys without 
any columns.

Would such an option be okay to add to the hadoop config and to the CFRR?

Jeremy

range ghosts and more with hadoop support (with proposed solution)

Reply via email to