Hi, Did you ever figure this one out? I'm seeing the same behavior:
Calling cache() after a repartition() makes Spark cache the version of the RDD BEFORE the repartition, which means a shuffle everytime it is accessed.. However, calling cache before the repartition() seems to work fine, the cached version has the new partitioning applied. In summary, these 2 patterns dont seem to work as expected: ------- repartition() cache() -------- repartition() cache() count() --------- But this works: cache() repartition() Very strange.. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Issue-with-repartition-and-cache-tp10235p19664.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org