Re: Open file limit settings for Spark on Yarn job

2015-02-11 Thread Arun Luthra
I'm using Spark 1.1.0 with sort-based shuffle. I found that I can work around the issue by applying repartition(N) with a small enough N after creating the RDD, though I'm losing some speed/parallelism by doing this. For my algorithm I need to stay with groupByKey. On Tue, Feb 10, 2015 at 11:41 P

RE: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Felix C
Alternatively, is there another way to do it? groupByKey has been called out as expensive and should be avoid (it causes shuffling of data). I've generally found it possible to use reduceByKey instead --- Original Message --- From: "Arun Luthra" Sent: February 10, 2015 1:16 PM To: user@spark.a

Re: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Sandy Ryza
Hi Arun, The limit for the YARN user on the cluster nodes should be all that matters. What version of Spark are you using? If you can turn on sort-based shuffle it should solve this problem. -Sandy On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra wrote: > Hi, > > I'm running Spark on Yarn from a