Run into this performance report https://github.com/datastax/spark-cassandra-connector/issues/200
Does spark connector in its current state issue one CQL per vnode or task per vnode? Regards. On Tue, Sep 16, 2014 at 2:05 AM, DuyHai Doan <doanduy...@gmail.com> wrote: > Look into the source code of the Spark connector. CassandraRDD try to find > all token ranges (even when using vnodes) for each node (endpoint) and > create RDD partition to match this distribution of token ranges. Thus data > locality is guaranteed > > On Tue, Sep 16, 2014 at 4:39 AM, Eric Plowe <eric.pl...@gmail.com> wrote: > >> Interesting. The way I understand the spark connector is that it's >> basically a client executing a cql query and filling a spark rdd. Spark >> will then handle the partitioning of data. Again, this is my understanding, >> and it maybe incorrect. >> >> >> On Monday, September 15, 2014, Robert Coli <rc...@eventbrite.com> wrote: >> >>> On Mon, Sep 15, 2014 at 4:57 PM, Eric Plowe <eric.pl...@gmail.com> >>> wrote: >>> >>>> Based on this stackoverflow question, vnodes effect the number of >>>> mappers Hadoop needs to spawn. Which in then affect performance. >>>> >>>> With the spark connector for cassandra would the same situation happen? >>>> Would vnodes affect performance in a similar situation to Hadoop? >>>> >>> >>> I don't know what specifically Spark does here, but if it has the same >>> locality expectations as Hadoop generally, my belief would be : "yes." >>> >>> =Rob >>> >>> >