Spark's core module uses this connector to read data from Cassandra and create RDD's or DataFrames in its workspace (In memory/on disc, depending on the spark configurations). Then transformations or queries are applied on RDD's or DataFrames respectively. The end results are stored back into Cassandra using the connector.
Note: If you just want to read/write from Cassandra using spark, you can try Kundera's Spark-Cassandra Module <https://github.com/impetus-opensource/Kundera/wiki/Spark-Cassandra-Module>. Kundera exposes the operations in a JPA way and helps in quick development. -Karthik On Fri, Oct 9, 2015 at 8:09 PM, Marcelo Valle (BLOOMBERG/ LONDON) < mvallemil...@bloomberg.net> wrote: > I know the connector, but having the connector only means it will take > *input* data from Cassandra, right? What about intermediate results? > If it stores intermediate results on Cassandra, could you please clarify > how data locality is handled? Will it store in other keyspace? > I could not find any doc about it... > > From: user@cassandra.apache.org > Subject: Re: Spark and intermediate results > > You can run spark against your Cassandra data directly without using a > shared filesystem. > > https://github.com/datastax/spark-cassandra-connector > > > On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle (BLOOMBERG/ LONDON) < > mvallemil...@bloomberg.net> wrote: > >> Hello, >> >> I saw this nice link from an event: >> >> >> http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D >> >> I would like to test using Spark to perform some operations on a column >> family, my objective is reading from CF A and writing the output of my M/R >> job to CF B. >> >> That said, I've read this from Spark's FAQ ( >> http://spark.apache.org/faq.html): >> >> "Do I need Hadoop to run Spark? >> No, but if you run on a cluster, you will need some form of shared file >> system (for example, NFS mounted at the same path on each node). If you >> have this type of filesystem, you can just deploy Spark in standalone mode. >> " >> >> The question I ask is - if I don't want to have a HDFS instalation just >> to run Spark on Cassandra, is my only option to have this NFS mounted over >> network? >> It doesn't seem smart to me to have something as NFS to store Spark >> files, as it would probably affect performance, and at the same time I >> wouldn't like to have an additional HDFS cluster just to run jobs on >> Cassandra. >> Is there a way of using Cassandra itself as this "some form of shared >> file system"? >> >> -Marcelo >> >> >> << ideas don't deserve respect >> >> > > > > << ideas don't deserve respect >> >