Hi Matteo,
* Hadoop MapReduce can talk to Cassandra and process the data just like
other input formats does from HDFS. But I would not recommend seeing
Cassandra as a first class replacement for HDFS, they are two very
different beasts. It will most likely always be a lot faster to let
MapReduce read data from HDFS. If you are going to run many jobs over
the same data from Cassandra I would recommend first using a MapReduce
job that just fetches the data to HDFS.
* The data is fetched from Cassandra using Thrift so you don't have to
run the Hadoop nodes on the same nodes as Cassandra.
* The input format will try to read from the local node if possible.
/Johan
Matteo Caprari wrote:
Hi.
I've tried the mapreduce example in 0.6 contrib/wordcount and it
worked very well.
I have a shallow understanding of both worlds, so pardon my questions:
Is the integration with hadoop just 'semantic' (ie map/reduce api is
only used as query abstraction) or is
it 'structural' (ie cassandra can 'talk to hadoop' and replace HDFS as
input source)?
In practice:
- If I want to run a distributed mapreduce job on cassandra, does my
cassandra cluster have to be an hadoop cluster as well?
- do I get data locality optimization: I reckon cassandra can in
principle figure out where it is best to execute a
SlicePredicate/Mapper,
but to do so it should take over some of the responsibilities of
hadoop's jobtracker. Does it?
Thanks.