I'm using Spark 1.1.0 with sort-based shuffle. I found that I can work around the issue by applying repartition(N) with a small enough N after creating the RDD, though I'm losing some speed/parallelism by doing this. For my algorithm I need to stay with groupByKey.
On Tue, Feb 10, 2015 at 11:41 PM, Felix C <felixcheun...@hotmail.com> wrote: > Alternatively, is there another way to do it? > groupByKey has been called out as expensive and should be avoid (it causes > shuffling of data). > > I've generally found it possible to use reduceByKey instead > > --- Original Message --- > > From: "Arun Luthra" <arun.lut...@gmail.com> > Sent: February 10, 2015 1:16 PM > To: user@spark.apache.org > Subject: Open file limit settings for Spark on Yarn job > > Hi, > > I'm running Spark on Yarn from an edge node, and the tasks on the run > Data Nodes. My job fails with the "Too many open files" error once it gets > to groupByKey(). Alternatively I can make it fail immediately if I > repartition the data when I create the RDD. > > Where do I need to make sure that ulimit -n is high enough? > > On the edge node it is small, 1024, but on the data nodes, the "yarn" > user has a high limit, 32k. But is the yarn user the relevant user? And, is > the 1024 limit for myself on the edge node a problem or is that limit not > relevant? > > Arun >