Re: Cassandra and hadoop?

Johan Oskarsson Wed, 17 Mar 2010 01:49:43 -0700

Hi Matteo,

* Hadoop MapReduce can talk to Cassandra and process the data just likeother input formats does from HDFS. But I would not recommend seeingCassandra as a first class replacement for HDFS, they are two verydifferent beasts. It will most likely always be a lot faster to letMapReduce read data from HDFS. If you are going to run many jobs overthe same data from Cassandra I would recommend first using a MapReducejob that just fetches the data to HDFS.

* The data is fetched from Cassandra using Thrift so you don't have torun the Hadoop nodes on the same nodes as Cassandra.


* The input format will try to read from the local node if possible.

/Johan

Matteo Caprari wrote:

Hi.

I've tried the mapreduce example in 0.6 contrib/wordcount and it
worked very well.

I have a shallow understanding of both worlds, so pardon my questions:

Is the integration with hadoop just 'semantic' (ie map/reduce api is
only used as query abstraction) or is
it 'structural' (ie cassandra can 'talk to hadoop' and replace HDFS as
input source)?

In practice:
- If I want to run a distributed mapreduce job on cassandra, does my
cassandra cluster have to be an hadoop cluster as well?
- do I get data locality optimization: I reckon cassandra can in
principle figure out where it is best to execute a
SlicePredicate/Mapper,
but to do so it should take over some of the responsibilities of
hadoop's jobtracker. Does it?

Thanks.

Re: Cassandra and hadoop?

Reply via email to