can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat.
An example of using it with Hadoop is here:
http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

Using it with Spark will be similar to the examples:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
and
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas <gerard.m...@gmail.com> wrote:

> Hi,
>
> (My excuses for the cross-post from SO)
>
> I'm trying to create Cassandra SSTables from the results of a batch
> computation in Spark. Ideally, each partition should create the SSTable for
> the data it holds in order to parallelize the process as much as possible
> (and probably even stream it to the Cassandra ring as well)
>
> After the initial hurdles with the CQLSSTableWriter (like requiring the
> yaml file), I'm confronted now with this issue:
>
> java.lang.RuntimeException: Attempting to load already loaded column family 
> customer.rawts
>     at org.apache.cassandra.config.Schema.load(Schema.java:347)
>     at org.apache.cassandra.config.Schema.load(Schema.java:112)
>     at 
> org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)
>
> I'm creating a writer on each parallel partition like this:
>
> def store(rdd:RDD[Message]) = {
>     rdd.foreachPartition( msgIterator => {
>       val writer = CQLSSTableWriter.builder()
>         .inDirectory("/tmp/cass")
>         .forTable(schema)
>         .using(insertSttmt).build()
>       msgIterator.foreach(msg => {...})
>     })}
>
> And if I'm reading the exception correctly, I can only create one writer
> per table in one JVM. Digging a bit further in the code, it looks like the
> Schema.load(...) singleton enforces that limitation.
>
> I guess writings to the writer will not be thread-safe and even if they
> were the contention that multiple threads will create by having all
> parallel tasks trying to dump few GB of data to disk at the same time will
> defeat the purpose of using the SSTables for bulk upload anyway.
>
> So, are there ways to use the CQLSSTableWriter concurrently?
>
> If not, what is the next best option to load batch data at high throughput
> in Cassandra?
>
> Will the upcoming Spark-Cassandra integration help with this? (ie. should
> I just sit back, relax and the problem will solve itself?)
>
> Thanks,
>
> Gerard.
>

Reply via email to