Re: Spark shuffle consolidateFiles performance degradation numbers

Matei Zaharia Mon, 03 Nov 2014 18:29:14 -0800

In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
better performance while creating fewer files. So I'd suggest trying that too.


Matei

> On Nov 3, 2014, at 6:12 PM, Andrew Or <and...@databricks.com> wrote:
> 
> Hey Matt,
> 
> There's some prior work that compares consolidation performance on some
> medium-scale workload:
> http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
> 
> There we noticed about 2x performance degradation in the reduce phase on
> ext3. I am not aware of any other concrete numbers. Maybe others have more
> experiences to add.
> 
> -Andrew
> 
> 2014-11-03 17:26 GMT-08:00 Matt Cheah <mch...@palantir.com>:
> 
>> Hi everyone,
>> 
>> I'm running into more and more cases where too many files are opened when
>> spark.shuffle.consolidateFiles is turned off.
>> 
>> I was wondering if this is a common scenario among the rest of the
>> community, and if so, if it is worth considering the setting to be turned
>> on by default. From the documentation, it seems like the performance could
>> be hurt on ext3 file systems. However, what are the concrete numbers of
>> performance degradation that is seen typically? A 2x slowdown in the
>> average job? 3x? Also, what cause the performance degradation on ext3 file
>> systems specifically?
>> 
>> Thanks,
>> 
>> -Matt Cheah
>> 
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark shuffle consolidateFiles performance degradation numbers

Reply via email to