Re: Efficiency of leftOuterJoin a cassandra rdd

Wush Wu Tue, 14 Jul 2015 21:36:56 -0700

Dear all,

I have found a post discussing the same thing:
https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J


The solution is using "joinWithCassandraTable" and the documentation
is here: 
https://github.com/datastax/spark-cassandra-connector/blob/v1.3.0-M2/doc/2_loading.md

Wush

2015-07-15 12:15 GMT+08:00 Wush Wu <wush...@gmail.com>:
> Dear all,
>
> I am trying to join two RDDs, named rdd1 and rdd2.
>
> rdd1 is loaded from a textfile with about 33000 records.
>
> rdd2 is loaded from a table in cassandra which has about 3 billions records.
>
> I tried the following code:
>
> ```scala
>
> val rdd1 : (String, XXX) = sc.textFile(...).map(...)
> import org.apache.spark.sql.cassandra.CassandraSQLContext
> cc.setKeyspace("xxx")
> val rdd2 : (String, String) = cc.sql("SELECT x, y FROM xxx").map(r => ...)
>
> val result = rdd1.leftOuterJoin(rdd2)
> result.take(20)
>
> ```
>
> However, the log shows that the spark loaded 3 billions records from
> cassandra and only 33000 records left at the end.
>
> Is there a way to query the cassandra based on the key in rdd1?
>
> Here is some information of our system:
>
> - The spark version is 1.3.1
> - The cassandra version is 2.0.14
> - The key of joining is the primary key of the cassandra table.
>
> Best,
> Wush

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Efficiency of leftOuterJoin a cassandra rdd

Reply via email to