Hi, I am new to Spark, have used hadoop for some time and just entered the mailing list.
I am considering using spark in my application, reading data from Cassandra in Python and writting mapped data back to Cassandra or to ES after it. The first question I have is: Is it possible to use https://github.com/datastax/spark-cassandra-connector with pyspark? I noticed there is an example of cassandra input format in the master branch, but I guess it will only work in the last release. The second question is about how Spark does M/R over NoSQL tools like Cassandra. If I understood it correctly, By using spark cassandra connector an RDD is provided and I can read data from Cassandra, and use Spark to M/R it. However, when I do that, I still need HDFS to store intermediate results. Correct me if I am wrong, but MAP results are stored in local filesystem, then a partitioner is used to shuffle data to Spark nodes and then data is reduced. I would like to understand why doing that using a tool like Cassandra, for example. Cassandra has partitioners itself, so I could just write the MAP output (using batch inserts) to an intermediate column family and, after map phase is complete, reduce the data. No need for shuffling, as Cassandra does that very well. Do you agree with my understanding? I wonder if I can do that using Spark, if this could be a good feature in future or if you have good reasons to think it would not perform well or something like that. Thanks in advance, I look forward for answers. Best regards, Marcelo Valle.
