Ian: I think that as get_range_slice gets faster, the approach that Mark was 
heading toward may be considerably more efficient than reading the old value in 
the OutputFormat.

Mark: Reading all of the data you want to update out of Cassandra using the 
InputFormat, merging it with (tagged) new data, and then performing 
deletes/inserts (without reads) in a Cassandra OutputFormat makes a lot of 
sense to me. It gives you much better data locality: all reads happen in a 
streaming fashion (mostly: this is where we need improvement for 
get_range_slices) at the beginning of the job.


-----Original Message-----
From: "Ian Kallen" <spidaman.l...@gmail.com>
Sent: Thursday, May 6, 2010 5:14pm
To: user@cassandra.apache.org
Subject: Re: Updating (as opposed to just setting) Cassandra data via Hadoop

I have inputs that are text logs and I wrote a Cassandra OutputFormat, the
reducers read the old values from their respective column families,
increment the counts and write back the new values. Since all of the writes
are done by the hadoop jobs and we're not running multiple jobs
concurrently, the writes aren't clobbering any values except the ones from
the prior hadoop run. If/when atomic increments are available, we'd be able
to run concurrent log processing jobs for but for now, this seems to work. I
think the biggest risk is that a reduce task fails, hadoop restarts it and
the replacement task re-increments the values. We're going to wait and see
to determine how much of a problem this is in practice.
-Ian

On Tue, May 4, 2010 at 8:53 PM, Mark Schnitzius
<mark.schnitz...@cxense.com>wrote:

> I have a situation where I need to accumulate values in Cassandra on an
> ongoing basis.  Atomic increments are still in the works apparently (see
> https://issues.apache.org/jira/browse/CASSANDRA-721) so for the time being
> I'll be using Hadoop, and attempting to feed in both the existing values and
> the new values to a M/R process where they can be combined together and
> written back out to Cassandra.
>
> The approach I'm taking is to use Hadoop's CombineFileInputFormat to blend
> the existing data (using Cassandra's ColumnFamilyInputFormat) with the newly
> incoming data (using something like Hadoop's SequenceFileInputFormat).
>
> I was just wondering, has anyone here tried this, and were there issues?
>  I'm worried because the CombineFileInputFormat has restrictions around
> splits being from different pools so I don't know how this will play out
> with data from both Cassandra and HDFS.  The other option, I suppose, is to
> use a separate M/R process to replicate the data onto HDFS first, but I'd
> rather avoid the extra step and duplication of storage.
>
> Also, if you've tackled a similar situation in the past using a different
> approach, I'd be keen to hear about it...
>
>
> Thanks
> Mark
>


Reply via email to