Updating (as opposed to just setting) Cassandra data via Hadoop

Mark Schnitzius Tue, 04 May 2010 20:53:59 -0700

I have a situation where I need to accumulate values in Cassandra on an
ongoing basis.  Atomic increments are still in the works apparently (see
https://issues.apache.org/jira/browse/CASSANDRA-721) so for the time being
I'll be using Hadoop, and attempting to feed in both the existing values and
the new values to a M/R process where they can be combined together and
written back out to Cassandra.


The approach I'm taking is to use Hadoop's CombineFileInputFormat to blend
the existing data (using Cassandra's ColumnFamilyInputFormat) with the newly
incoming data (using something like Hadoop's SequenceFileInputFormat).

I was just wondering, has anyone here tried this, and were there issues?
 I'm worried because the CombineFileInputFormat has restrictions around
splits being from different pools so I don't know how this will play out
with data from both Cassandra and HDFS.  The other option, I suppose, is to
use a separate M/R process to replicate the data onto HDFS first, but I'd
rather avoid the extra step and duplication of storage.

Also, if you've tackled a similar situation in the past using a different
approach, I'd be keen to hear about it...


Thanks
Mark

Updating (as opposed to just setting) Cassandra data via Hadoop

Reply via email to