Sorry, I also have some follow-up questions. "In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing."
Some questions came to mind just now: 1) It would be nice to have a brief overview as to what these files are being used for? 2) Is this C*X files being opened on each machine? Also, is C the total number of cores among all machines in the cluster? Thanks, -Matt Cheah On Tue, Mar 11, 2014 at 4:35 PM, Matthew Cheah <matthew.c.ch...@gmail.com>wrote: > Thanks. Just curious, is there a default number of reducers that are used? > > -Matt Cheah > > > On Mon, Mar 10, 2014 at 7:22 PM, Patrick Wendell <pwend...@gmail.com>wrote: > >> Hey Matt, >> >> The best way is definitely just to increase the ulimit if possible, >> this is sort of an assumption we make in Spark that clusters will be >> able to move it around. >> >> You might be able to hack around this by decreasing the number of >> reducers but this could have some performance implications for your >> job. >> >> In general if a node in your cluster has C assigned cores and you run >> a job with X reducers then Spark will open C*X files in parallel and >> start writing. Shuffle consolidation will help decrease the total >> number of files created but the number of file handles open at any >> time doesn't change so it won't help the ulimit problem. >> >> This means you'll have to use fewer reducers (e.g. pass reduceByKey a >> number of reducers) or use fewer cores on each machine. >> >> - Patrick >> >> On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah >> <matthew.c.ch...@gmail.com> wrote: >> > Hi everyone, >> > >> > My team (cc'ed in this e-mail) and I are running a Spark reduceByKey >> > operation on a cluster of 10 slaves where I don't have the privileges >> to set >> > "ulimit -n" to a higher number. I'm running on a cluster where "ulimit >> -n" >> > returns 1024 on each machine. >> > >> > When I attempt to run this job with the data originating from a text >> file, >> > stored in an HDFS cluster running on the same nodes as the Spark >> cluster, >> > the job crashes with the message, "Too many open files". >> > >> > My question is, why are so many files being created, and is there a way >> to >> > configure the Spark context to avoid spawning that many files? I am >> already >> > setting spark.shuffle.consolidateFiles to true. >> > >> > I want to repeat - I can't change the maximum number of open file >> > descriptors on the machines. This cluster is not owned by me and the >> system >> > administrator is responding quite slowly. >> > >> > Thanks, >> > >> > -Matt Cheah >> > >