Re: Issue with repartition and cache

2016-10-31 Thread ankits
Hi, Did you ever figure this one out? I'm seeing the same behavior: Calling cache() after a repartition() makes Spark cache the version of the RDD BEFORE the repartition, which means a shuffle everytime it is accessed.. However, calling cache before the repartition() seems to work fine, the cach

Re: Issue with repartition and cache

2015-01-21 Thread Sandy Ryza
Hi Dirceu, Does the issue not show up if you run "map(f => f(1).asInstanceOf[Int]).sum" on the "train" RDD? It appears that f(1) is an String, not an Int. If you're looking to parse and convert it, "toInt" should be used instead of "asInstanceOf". -Sandy On Wed, Jan 21, 2015 at 8:43 AM, Dirceu

Re: Issue with repartition and cache

2015-01-21 Thread Dirceu Semighini Filho
Hi Sandy, thanks for the reply. I tried to run this code without the cache and it worked. Also if I cache before repartition, it also works, the problem seems to be something related with repartition and caching. My train is a SchemaRDD, and if I make all my columns as StringType, the error doesn'