It sounds reasonable to me, with the caveat that I have only limited Hadoop knowledge.
Please write up a blog post when you get it working. :) On Wed, May 5, 2010 at 10:44 PM, Mark Schnitzius <mark.schnitz...@cxense.com> wrote: > Apologies, Hadoop recently deprecated a whole bunch of classes and I > misunderstood how the new ones work. > What I'll be doing is creating an InputFormat class that > uses ColumnFamilyInputFormat to get splits from the existing Cassandra data, > and merges them with splits from a SequenceFileInputFormat. > Is this a reasonable approach, or is there a better, more standard way to > update Cassandra data with new Hadoop data? It may just boil down to a > design decision, but it would seem to me that this would be a problem that > would've been encountered many times before... > > Thanks > Mark > > > On Thu, May 6, 2010 at 12:23 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> I'm a little confused. CombineFileInputFormat is designed to combine >> multiple small input splits into one larger one. It's not for merging >> data (that needs to be part of the reduce phase). Maybe I'm >> misunderstanding what you're saying. >> >> On Tue, May 4, 2010 at 10:53 PM, Mark Schnitzius >> <mark.schnitz...@cxense.com> wrote: >> > I have a situation where I need to accumulate values in Cassandra on an >> > ongoing basis. Atomic increments are still in the works apparently >> > (see https://issues.apache.org/jira/browse/CASSANDRA-721) so for the >> > time >> > being I'll be using Hadoop, and attempting to feed in both the existing >> > values and the new values to a M/R process where they can be combined >> > together and written back out to Cassandra. >> > The approach I'm taking is to use Hadoop's CombineFileInputFormat to >> > blend >> > the existing data (using Cassandra's ColumnFamilyInputFormat) with the >> > newly >> > incoming data (using something like Hadoop's SequenceFileInputFormat). >> > I was just wondering, has anyone here tried this, and were there issues? >> > I'm worried because the CombineFileInputFormat has restrictions around >> > splits being from different pools so I don't know how this will play out >> > with data from both Cassandra and HDFS. The other option, I suppose, is >> > to >> > use a separate M/R process to replicate the data onto HDFS first, but >> > I'd >> > rather avoid the extra step and duplication of storage. >> > Also, if you've tackled a similar situation in the past using a >> > different >> > approach, I'd be keen to hear about it... >> > >> > Thanks >> > Mark >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com