Hi Arun,

The limit for the YARN user on the cluster nodes should be all that
matters.  What version of Spark are you using?  If you can turn on
sort-based shuffle it should solve this problem.

-Sandy

On Tue, Feb 10, 2015 at 1:16 PM, Arun Luthra <arun.lut...@gmail.com> wrote:

> Hi,
>
> I'm running Spark on Yarn from an edge node, and the tasks on the run Data
> Nodes. My job fails with the "Too many open files" error once it gets to
> groupByKey(). Alternatively I can make it fail immediately if I repartition
> the data when I create the RDD.
>
> Where do I need to make sure that ulimit -n is high enough?
>
> On the edge node it is small, 1024, but on the data nodes, the "yarn" user
> has a high limit, 32k. But is the yarn user the relevant user? And, is the
> 1024 limit for myself on the edge node a problem or is that limit not
> relevant?
>
> Arun
>

Reply via email to