Look into the source code of the Spark connector. CassandraRDD try to find all token ranges (even when using vnodes) for each node (endpoint) and create RDD partition to match this distribution of token ranges. Thus data locality is guaranteed
On Tue, Sep 16, 2014 at 4:39 AM, Eric Plowe <eric.pl...@gmail.com> wrote: > Interesting. The way I understand the spark connector is that it's > basically a client executing a cql query and filling a spark rdd. Spark > will then handle the partitioning of data. Again, this is my understanding, > and it maybe incorrect. > > > On Monday, September 15, 2014, Robert Coli <rc...@eventbrite.com> wrote: > >> On Mon, Sep 15, 2014 at 4:57 PM, Eric Plowe <eric.pl...@gmail.com> wrote: >> >>> Based on this stackoverflow question, vnodes effect the number of >>> mappers Hadoop needs to spawn. Which in then affect performance. >>> >>> With the spark connector for cassandra would the same situation happen? >>> Would vnodes affect performance in a similar situation to Hadoop? >>> >> >> I don't know what specifically Spark does here, but if it has the same >> locality expectations as Hadoop generally, my belief would be : "yes." >> >> =Rob >> >>