Dear all, I am trying to join two RDDs, named rdd1 and rdd2.
rdd1 is loaded from a textfile with about 33000 records. rdd2 is loaded from a table in cassandra which has about 3 billions records. I tried the following code: ```scala val rdd1 : (String, XXX) = sc.textFile(...).map(...) import org.apache.spark.sql.cassandra.CassandraSQLContext cc.setKeyspace("xxx") val rdd2 : (String, String) = cc.sql("SELECT x, y FROM xxx").map(r => ...) val result = rdd1.leftOuterJoin(rdd2) result.take(20) ``` However, the log shows that the spark loaded 3 billions records from cassandra and only 33000 records left at the end. Is there a way to query the cassandra based on the key in rdd1? Here is some information of our system: - The spark version is 1.3.1 - The cassandra version is 2.0.14 - The key of joining is the primary key of the cassandra table. Best, Wush --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org