Re: Spark SQL on Cassandra

Michael Armbrust Mon, 08 Sep 2014 15:12:19 -0700

I believe DataStax is working on better integration here, but until that is
ready you can use the applySchema API.  Basically you will convert the
CassandraTable into and RDD of Row objects using a .map() and then you can
call applySchema (provided by SQLContext) to get a SchemaRDD.


More details will be available in the SQL Programming Guide for 1.1 (which
will hopefully be published in the next day or two).  You can see the raw
version here:
https://raw.githubusercontent.com/apache/spark/master/docs/sql-programming-guide.md

Look for section: Programmatically Specifying the Schema

On Mon, Sep 8, 2014 at 7:22 AM, gtinside <gtins...@gmail.com> wrote:

> Hi ,
>
> I am reading data from Cassandra through datastax spark-cassandra connector
> converting it into JSON and then running spark-sql on it. Refer to the code
> snippet below :
>
> step 1 >>>>> val o_rdd = sc.cassandraTable[CassandraRDDWrapper](
> '<keyspace>', '<column_family>')
> step 2 >>>>> val tempObjectRDD = sc.parallelize(o_rdd.toArray.map(i=>i),
> 100)
> step 3 >>>>> val objectRDD = sqlContext.jsonRDD(tempObjectRDD)
> step 4 >>>>> objectRDD .registerAsTable("objects")
>
> At step (2) I have to explicitly do a "toArray" because jsonRDD takes in a
> RDD[String]. For me calling "toArray" on cassandra rdd takes forever as
> have
> million records in cassandra . Is there a better way of doing this ? How
> can
> I optimize it ?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-on-Cassandra-tp13696.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark SQL on Cassandra

Reply via email to