Hey Andrew, Matei, Thanks for responding.
For some more context, we were running into "Too many open files" issues where we were seeing this happen immediately after the Collect phase (about 30 seconds into a run) on a decently sized dataset (14 MM rows). The ulimit set in the spark-env was 256,000 which we believe should have been enough, but even with it set at that number, we were still seeing issues. Can you comment on what a "good" ulimit should be in these cases? We believe what might have caused this is some process got orphaned without cleaning up its open file handles. However, other than anecdotal evidence and some speculation, we don't have much evidence to expand on this further. We were wondering if we could get some more information about how many files get opened during a shuffle. We discussed that it is going to be around N x M, where N is the number of Tasks and M is the number of Reducers. Does this sound about right? Are there any other considerations we should be aware of when setting consolidateFiles to True? Thanks, Zach Fry Palantir | Developer Support Engineer z...@palantir.com <mailto:em...@palantir.com> | 650.226.6338 On 11/3/14 6:28 09PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote: >In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will >have better performance while creating fewer files. So I'd suggest trying >that too. > >Matei > >> On Nov 3, 2014, at 6:12 PM, Andrew Or <and...@databricks.com> wrote: >> >> Hey Matt, >> >> There's some prior work that compares consolidation performance on some >> medium-scale workload: >> >>https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cs.berkeley.edu_- >>7Ekubitron_courses_cs262a-2DF13_projects_reports_project16-5Freport.pdf&d >>=AAIFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=0Yj0NJdi423O9rGnW >>Dox5yE_2OXftYbKeoFygDwj99U&m=fQgGKwxzg3lfq5XUaEZy674jjtWDSrFOHIrIDFEGpQc& >>s=ukSpYSbxzzrYdHJEXPMx3gGsErP2vA2PMdBVsY3EOnA&e= >> >> There we noticed about 2x performance degradation in the reduce phase on >> ext3. I am not aware of any other concrete numbers. Maybe others have >>more >> experiences to add. >> >> -Andrew >> >> 2014-11-03 17:26 GMT-08:00 Matt Cheah <mch...@palantir.com>: >> >>> Hi everyone, >>> >>> I'm running into more and more cases where too many files are opened >>>when >>> spark.shuffle.consolidateFiles is turned off. >>> >>> I was wondering if this is a common scenario among the rest of the >>> community, and if so, if it is worth considering the setting to be >>>turned >>> on by default. From the documentation, it seems like the performance >>>could >>> be hurt on ext3 file systems. However, what are the concrete numbers of >>> performance degradation that is seen typically? A 2x slowdown in the >>> average job? 3x? Also, what cause the performance degradation on ext3 >>>file >>> systems specifically? >>> >>> Thanks, >>> >>> -Matt Cheah >>> >>> >>> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org