subject:"Operations with cached RDD"

RE: Operations with cached RDD

2015-10-12 Thread Ulanov, Alexander

@spark.apache.org Subject: Re: Operations with cached RDD The problem is not that zipWithIndex is executed again. "groupBy" triggered hash partitioning on your keys and a shuffle happened due to that and that's why you are seeing 2 stages. You can confirm this by clicking on latter "zip

Re: Operations with cached RDD

2015-10-11 Thread Nitin Goyal

The problem is not that zipWithIndex is executed again. "groupBy" triggered hash partitioning on your keys and a shuffle happened due to that and that's why you are seeing 2 stages. You can confirm this by clicking on latter "zipWithIndex" stage and input data has "(memory)" written which means inp

Operations with cached RDD

2015-10-09 Thread Ulanov, Alexander

Dear Spark developers, I am trying to understand how Spark UI displays operation with the cached RDD. For example, the following code caches an rdd: >> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache >> rdd.count The Jobs tab shows me that the RDD is evaluated: : 1 count at :24