In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too.
Matei > On Nov 3, 2014, at 6:12 PM, Andrew Or <and...@databricks.com> wrote: > > Hey Matt, > > There's some prior work that compares consolidation performance on some > medium-scale workload: > http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf > > There we noticed about 2x performance degradation in the reduce phase on > ext3. I am not aware of any other concrete numbers. Maybe others have more > experiences to add. > > -Andrew > > 2014-11-03 17:26 GMT-08:00 Matt Cheah <mch...@palantir.com>: > >> Hi everyone, >> >> I'm running into more and more cases where too many files are opened when >> spark.shuffle.consolidateFiles is turned off. >> >> I was wondering if this is a common scenario among the rest of the >> community, and if so, if it is worth considering the setting to be turned >> on by default. From the documentation, it seems like the performance could >> be hurt on ext3 file systems. However, what are the concrete numbers of >> performance degradation that is seen typically? A 2x slowdown in the >> average job? 3x? Also, what cause the performance degradation on ext3 file >> systems specifically? >> >> Thanks, >> >> -Matt Cheah >> >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org