Hi Arun, The limit for the YARN user on the cluster nodes should be all that matters. What version of Spark are you using? If you can turn on sort-based shuffle it should solve this problem.
-Sandy On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > Hi, > > I'm running Spark on Yarn from an edge node, and the tasks on the run Data > Nodes. My job fails with the "Too many open files" error once it gets to > groupByKey(). Alternatively I can make it fail immediately if I repartition > the data when I create the RDD. > > Where do I need to make sure that ulimit -n is high enough? > > On the edge node it is small, 1024, but on the data nodes, the "yarn" user > has a high limit, 32k. But is the yarn user the relevant user? And, is the > 1024 limit for myself on the edge node a problem or is that limit not > relevant? > > Arun >