Hi all, I was wondering why the RDD.count call recomputes the RDD in all cases? In most cases it can simply ask the next dependent RDD. I have several RDD implementations and was surprised to see a call like the following never call my RDD's count method but instead recompute/traverse the entire dataset:
val myRDD: MyRDD = ... myRDD.map({ ... }).count() Unless I'm mistaken, a MappedRDD never needs to do more than call 'count' on the underlying RDD. The underlying RDD's count method (in all of my cases) know their count without a recompute (e.g. one of them selects the count from a DB). This is MUCH less expensive than recomputing the RDD. Thanks. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org