Hi Matteo,

* Hadoop MapReduce can talk to Cassandra and process the data just like other input formats does from HDFS. But I would not recommend seeing Cassandra as a first class replacement for HDFS, they are two very different beasts. It will most likely always be a lot faster to let MapReduce read data from HDFS. If you are going to run many jobs over the same data from Cassandra I would recommend first using a MapReduce job that just fetches the data to HDFS.

* The data is fetched from Cassandra using Thrift so you don't have to run the Hadoop nodes on the same nodes as Cassandra.

* The input format will try to read from the local node if possible.

/Johan

Matteo Caprari wrote:
Hi.

I've tried the mapreduce example in 0.6 contrib/wordcount and it
worked very well.

I have a shallow understanding of both worlds, so pardon my questions:

Is the integration with hadoop just 'semantic' (ie map/reduce api is
only used as query abstraction) or is
it 'structural' (ie cassandra can 'talk to hadoop' and replace HDFS as
input source)?

In practice:
- If I want to run a distributed mapreduce job on cassandra, does my
cassandra cluster have to be an hadoop cluster as well?
- do I get data locality optimization: I reckon cassandra can in
principle figure out where it is best to execute a
SlicePredicate/Mapper,
but to do so it should take over some of the responsibilities of
hadoop's jobtracker. Does it?

Thanks.

Reply via email to