Hello all,

I worked around this for now using the class (that I already had) that
inherits from RDD and is the one all of our custom RDDs inherit from. I did
the following:

1) Overload all of the transformations (that get used in our app) that don't
change the RDD size wrapping the results with a proxy rdd that intercepts
the count() call returning a cached version or calling an abstract
"calculateSize" if it doesn't already know the count.

2) piggyback a count calculation on all actions that we use (aggregate,
reduce, fold, foreach) so that as a side effect of calling any of these, if
the count isn't already known, it's calculated and stored.

The one thing I couldn't do (at least yet) was get zipWithIndex to calculate
the count because it's implementation is too opaque inside of the RDD.

If anyone wants to see the code I can post it.

Thanks for the responses.

Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11311.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to