@spark.apache.org
Subject: Re: Operations with cached RDD
The problem is not that zipWithIndex is executed again. "groupBy" triggered
hash partitioning on your keys and a shuffle happened due to that and that's
why you are seeing 2 stages. You can confirm this by clicking on latter
"zip
The problem is not that zipWithIndex is executed again. "groupBy" triggered
hash partitioning on your keys and a shuffle happened due to that and
that's why you are seeing 2 stages. You can confirm this by clicking on
latter "zipWithIndex" stage and input data has "(memory)" written which
means inp
Dear Spark developers,
I am trying to understand how Spark UI displays operation with the cached RDD.
For example, the following code caches an rdd:
>> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache
>> rdd.count
The Jobs tab shows me that the RDD is evaluated:
: 1 count at :24