Hi,

Did you ever figure this one out? I'm seeing the same behavior:

Calling cache() after a repartition() makes Spark cache the version of the
RDD BEFORE the repartition, which means a shuffle everytime it is accessed..

However, calling cache before the repartition() seems to work fine, the
cached version has the new partitioning applied.


In summary, these 2 patterns dont seem to work as expected:

-------
repartition()
cache()
--------
repartition()
cache()
count()
---------



But this works:

cache()
repartition()

Very strange..



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Issue-with-repartition-and-cache-tp10235p19664.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to