Dear Spark developers, I am trying to understand how Spark UI displays operation with the cached RDD.
For example, the following code caches an rdd: >> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache >> rdd.count The Jobs tab shows me that the RDD is evaluated: : 1 count at <console>:24 2015/10/09 16:15:43 0.4 s 1/1 : 0 zipWithIndex at <console> :21 2015/10/09 16:15:38 0.6 s 1/1 An I can observe this rdd in the Storage tab of Spark UI: : ZippedWithIndexRDD Memory Deserialized 1x Replicated Then I want to make an operation over the cached RDD. I run the following code: >> val g = rdd.groupByKey() >> g.count The Jobs tab shows me a new Job: : 2 count at <console>:26 Inside this Job there are two stages: : 3 count at <console>:26 +details 2015/10/09 16:16:18 0.2 s 5/5 : 2 zipWithIndex at <console>:21 It shows that zipWithIndex is executed again. It does not seem to be reasonable, because the rdd is cached, and zipWithIndex is already executed previously. Could you explain why if I perform an operation followed by an action on a cached RDD, then the last operation in the lineage of the cached RDD is shown to be executed in the Spark UI? Best regards, Alexander