I'm using Spark 1.1.0 with sort-based shuffle.
I found that I can work around the issue by applying repartition(N) with a
small enough N after creating the RDD, though I'm losing some
speed/parallelism by doing this. For my algorithm I need to stay with
groupByKey.
On Tue, Feb 10, 2015 at 11:41 P
Alternatively, is there another way to do it?
groupByKey has been called out as expensive and should be avoid (it causes
shuffling of data).
I've generally found it possible to use reduceByKey instead
--- Original Message ---
From: "Arun Luthra"
Sent: February 10, 2015 1:16 PM
To: user@spark.a
Hi Arun,
The limit for the YARN user on the cluster nodes should be all that
matters. What version of Spark are you using? If you can turn on
sort-based shuffle it should solve this problem.
-Sandy
On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra wrote:
> Hi,
>
> I'm running Spark on Yarn from a