Hi,
Did you ever figure this one out? I'm seeing the same behavior:
Calling cache() after a repartition() makes Spark cache the version of the
RDD BEFORE the repartition, which means a shuffle everytime it is accessed..
However, calling cache before the repartition() seems to work fine, the
cach
Hi Dirceu,
Does the issue not show up if you run "map(f =>
f(1).asInstanceOf[Int]).sum" on the "train" RDD? It appears that f(1) is
an String, not an Int. If you're looking to parse and convert it, "toInt"
should be used instead of "asInstanceOf".
-Sandy
On Wed, Jan 21, 2015 at 8:43 AM, Dirceu
Hi Sandy, thanks for the reply.
I tried to run this code without the cache and it worked.
Also if I cache before repartition, it also works, the problem seems to be
something related with repartition and caching.
My train is a SchemaRDD, and if I make all my columns as StringType, the
error doesn'