hi, There is good reason why the decision about caching is left for the user. Spark does not know about the future of the DataFrames and RDDs.
Think about how your program is running (you are still running program), so there is an exact point where the execution is and when Spark reaches an action it evaluates the Spark job but it does not know about the future jobs. A cached data would be only useful for that future job which will reuses it. On the other hand this information is available for the user as he writes all the jobs. Attila -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org