Thanks, Sean! Yes, I agree that this logging would still have some cost and so would not be used in production.
On Sat, Feb 21, 2015 at 1:37 AM, Sean Owen <so...@cloudera.com> wrote: > I think the cheapest possible way to force materialization is something > like > > rdd.foreachPartition(i => None) > > I get the use case, but as you can see there is a cost: you are forced > to materialize an RDD and cache it just to measure the computation > time. In principle this could be taking significantly more time than > not doing so, since otherwise several RDD stages might proceed without > ever even having to persist intermediate results in memory. > > Consider looking at the Spark UI to see how much time a stage took, > although it's measuring end to end wall clock time, which may overlap > with other computations. > > (or maybe you are disabling / enabling this logging for prod / test anyway) > > On Sat, Feb 21, 2015 at 4:46 AM, pnpritchard > <nicholas.pritch...@falkonry.com> wrote: > > Is there a technique for forcing the evaluation of an RDD? > > > > I have used actions to do so but even the most basic "count" has a > > non-negligible cost (even on a cached RDD, repeated calls to count take > > time). > > > > My use case is for logging the execution time of the major components in > my > > application. At the end of each component I have a statement like > > "rdd.cache().count()" and time how long it takes. > > > > Thanks in advance for any advice! > > Nick > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Force-RDD-evaluation-tp21748.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >