It sounds reasonable to me, with the caveat that I have only limited
Hadoop knowledge.

Please write up a blog post when you get it working. :)

On Wed, May 5, 2010 at 10:44 PM, Mark Schnitzius
<> wrote:
> Apologies, Hadoop recently deprecated a whole bunch of classes and I
> misunderstood how the new ones work.
> What I'll be doing is creating an InputFormat class that
> uses ColumnFamilyInputFormat to get splits from the existing Cassandra data,
> and merges them with splits from a SequenceFileInputFormat.
> Is this a reasonable approach, or is there a better, more standard way to
> update Cassandra data with new Hadoop data?  It may just boil down to a
> design decision, but it would seem to me that this would be a problem that
> would've been encountered many times before...
> Thanks
> Mark
> On Thu, May 6, 2010 at 12:23 AM, Jonathan Ellis <> wrote:
>> I'm a little confused.  CombineFileInputFormat is designed to combine
>> multiple small input splits into one larger one.  It's not for merging
>> data (that needs to be part of the reduce phase).  Maybe I'm
>> misunderstanding what you're saying.
>> On Tue, May 4, 2010 at 10:53 PM, Mark Schnitzius
>> <> wrote:
>> > I have a situation where I need to accumulate values in Cassandra on an
>> > ongoing basis.  Atomic increments are still in the works apparently
>> > (see so for the
>> > time
>> > being I'll be using Hadoop, and attempting to feed in both the existing
>> > values and the new values to a M/R process where they can be combined
>> > together and written back out to Cassandra.
>> > The approach I'm taking is to use Hadoop's CombineFileInputFormat to
>> > blend
>> > the existing data (using Cassandra's ColumnFamilyInputFormat) with the
>> > newly
>> > incoming data (using something like Hadoop's SequenceFileInputFormat).
>> > I was just wondering, has anyone here tried this, and were there issues?
>> >  I'm worried because the CombineFileInputFormat has restrictions around
>> > splits being from different pools so I don't know how this will play out
>> > with data from both Cassandra and HDFS.  The other option, I suppose, is
>> > to
>> > use a separate M/R process to replicate the data onto HDFS first, but
>> > I'd
>> > rather avoid the extra step and duplication of storage.
>> > Also, if you've tackled a similar situation in the past using a
>> > different
>> > approach, I'd be keen to hear about it...
>> >
>> > Thanks
>> > Mark
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support

Reply via email to