Efficiency of leftOuterJoin a cassandra rdd

Wush Wu Tue, 14 Jul 2015 21:16:49 -0700

Dear all,

I am trying to join two RDDs, named rdd1 and rdd2.


rdd1 is loaded from a textfile with about 33000 records.

rdd2 is loaded from a table in cassandra which has about 3 billions records.

I tried the following code:

```scala

val rdd1 : (String, XXX) = sc.textFile(...).map(...)
import org.apache.spark.sql.cassandra.CassandraSQLContext
cc.setKeyspace("xxx")
val rdd2 : (String, String) = cc.sql("SELECT x, y FROM xxx").map(r => ...)

val result = rdd1.leftOuterJoin(rdd2)
result.take(20)

```

However, the log shows that the spark loaded 3 billions records from
cassandra and only 33000 records left at the end.

Is there a way to query the cassandra based on the key in rdd1?

Here is some information of our system:

- The spark version is 1.3.1
- The cassandra version is 2.0.14
- The key of joining is the primary key of the cassandra table.

Best,
Wush

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Efficiency of leftOuterJoin a cassandra rdd

Reply via email to