The Hadoop integration (as demonstrated by contrib/word_count) is locality aware: it begins by querying Cassandra to generate locality aware splits, and when the hostnames match up between the Hadoop and Cassandra clusters, the data can be mapped locally.
-----Original Message----- From: "Maxim Grinev" <ma...@grinev.net> Sent: Tuesday, May 18, 2010 2:42am To: user@cassandra.apache.org Subject: Re: Hadoop over Cassandra On Tue, May 18, 2010 at 2:23 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > On Mon, May 17, 2010 at 4:12 PM, Vick Khera <vi...@khera.org> wrote: > > On Mon, May 17, 2010 at 3:46 PM, Jonathan Ellis <jbel...@gmail.com> > wrote: > >> Moving to the user@ list. > >> > >> http://wiki.apache.org/cassandra/HadoopSupport should be useful. > > > > That document doesn't really answer the "is data locality preserved" > > when running the map phase, but my hunch is "no". > > The answer is, "yes, as long as you have hadoop on all the cassandra > machines." (the case where it's easy to map cassandra locality to > hadoop locality :) Jonathan, could you please clarify this. I also cannot understand how it works. Even if Hadoop is deployed on all the Cassandra machines, how will Hadoop be aware of Cassandra's data placement (partitioning and replication)? Maxim