Alternatively, is there another way to do it?
groupByKey has been called out as expensive and should be avoid (it causes 
shuffling of data).

I've generally found it possible to use reduceByKey instead

--- Original Message ---

From: "Arun Luthra" <arun.lut...@gmail.com>
Sent: February 10, 2015 1:16 PM
To: user@spark.apache.org
Subject: Open file limit settings for Spark on Yarn job

Hi,

I'm running Spark on Yarn from an edge node, and the tasks on the run Data
Nodes. My job fails with the "Too many open files" error once it gets to
groupByKey(). Alternatively I can make it fail immediately if I repartition
the data when I create the RDD.

Where do I need to make sure that ulimit -n is high enough?

On the edge node it is small, 1024, but on the data nodes, the "yarn" user
has a high limit, 32k. But is the yarn user the relevant user? And, is the
1024 limit for myself on the edge node a problem or is that limit not
relevant?

Arun

Reply via email to