subject:"Re\: cached data between jobs"

Re: cached data between jobs

2015-09-02 Thread Eric Walker

Hi Jeff, I think I see what you're saying. I was thinking more of a whole Spark job, where `spark-submit` is run once to completion and then started up again, rather than a "job" as seen in the Spark UI. I take it there is no implicit caching of results between `spark-submit` runs. (In the case

Re: cached data between jobs

2015-09-01 Thread Jeff Zhang

Hi Eric, If the 2 jobs share the same parent stages. these stages can be skipped for the second job. Here's one simple example: val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e)) val rdd2 = rdd1.groupByKey() rdd2.map(e=>e._1).collect() foreach println rdd2.map(e=> (e._1, e._2.size)).collect foreac