Dear Spark developers,

I am trying to understand how Spark UI displays operation with the cached RDD.

For example, the following code caches an rdd:
>> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache
>> rdd.count
The Jobs tab shows me that the RDD is evaluated:
: 1 count at <console>:24              2015/10/09 16:15:43        0.4 s       
1/1
: 0 zipWithIndex at <console> :21             2015/10/09 16:15:38        0.6 s  
     1/1
An I can observe this rdd in the Storage tab of Spark UI:
: ZippedWithIndexRDD  Memory Deserialized 1x Replicated

Then I want to make an operation over the cached RDD. I run the following code:
>> val g = rdd.groupByKey()
>> g.count
The Jobs tab shows me a new Job:
: 2 count at <console>:26
Inside this Job there are two stages:
: 3 count at <console>:26 +details 2015/10/09 16:16:18   0.2 s       5/5
: 2 zipWithIndex at <console>:21
It shows that zipWithIndex is executed again. It does not seem to be 
reasonable, because the rdd is cached, and zipWithIndex is already executed 
previously.

Could you explain why if I perform an operation followed by an action on a 
cached RDD, then the last operation in the lineage of the cached RDD is shown 
to be executed in the Spark UI?


Best regards, Alexander

Reply via email to