I'm using Spark 1.1.0 with sort-based shuffle.

I found that I can work around the issue by applying repartition(N) with a
small enough N after creating the RDD, though I'm losing some
speed/parallelism by doing this. For my algorithm I need to stay with
groupByKey.

On Tue, Feb 10, 2015 at 11:41 PM, Felix C <felixcheun...@hotmail.com> wrote:

>  Alternatively, is there another way to do it?
> groupByKey has been called out as expensive and should be avoid (it causes
> shuffling of data).
>
> I've generally found it possible to use reduceByKey instead
>
> --- Original Message ---
>
> From: "Arun Luthra" <arun.lut...@gmail.com>
> Sent: February 10, 2015 1:16 PM
> To: user@spark.apache.org
> Subject: Open file limit settings for Spark on Yarn job
>
>  Hi,
>
>  I'm running Spark on Yarn from an edge node, and the tasks on the run
> Data Nodes. My job fails with the "Too many open files" error once it gets
> to groupByKey(). Alternatively I can make it fail immediately if I
> repartition the data when I create the RDD.
>
>  Where do I need to make sure that ulimit -n is high enough?
>
>  On the edge node it is small, 1024, but on the data nodes, the "yarn"
> user has a high limit, 32k. But is the yarn user the relevant user? And, is
> the 1024 limit for myself on the edge node a problem or is that limit not
> relevant?
>
>  Arun
>

Reply via email to