Re: Spark shuffle consolidateFiles performance degradation numbers

Zach Fry Mon, 03 Nov 2014 18:39:07 -0800

Hey Andrew, Matei,

Thanks for responding.

For some more context, we were running into "Too many open files" issues
where we were seeing this happen immediately after the Collect phase
(about 30 seconds into a run) on a decently sized dataset (14 MM rows).
The ulimit set in the spark-env was 256,000 which we believe should have
been enough, but even with it set at that number, we were still seeing
issues. 
Can you comment on what a "good" ulimit should be in these cases?

We believe what might have caused this is  some process got orphaned
without cleaning up its open file handles.
However, other than anecdotal evidence and some speculation, we don't have
much evidence to expand on this further.

We were wondering if we could get some more information about how many
files get opened during a shuffle.
We discussed that it is going to be around N x M, where N is the number of
Tasks and M is the number of Reducers.
Does this sound about right?

Are there any other considerations we should be aware of when setting
consolidateFiles to True?

Thanks, 
Zach Fry
Palantir | Developer Support Engineer
z...@palantir.com <mailto:em...@palantir.com> | 650.226.6338

On 11/3/14 6:28 09PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:

>In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will
>have better performance while creating fewer files. So I'd suggest trying
>that too.
>
>Matei
>
>> On Nov 3, 2014, at 6:12 PM, Andrew Or <and...@databricks.com> wrote:
>> 
>> Hey Matt,
>> 
>> There's some prior work that compares consolidation performance on some
>> medium-scale workload:
>> 
>>https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cs.berkeley.edu_-
>>7Ekubitron_courses_cs262a-2DF13_projects_reports_project16-5Freport.pdf&d
>>=AAIFAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=0Yj0NJdi423O9rGnW
>>Dox5yE_2OXftYbKeoFygDwj99U&m=fQgGKwxzg3lfq5XUaEZy674jjtWDSrFOHIrIDFEGpQc&
>>s=ukSpYSbxzzrYdHJEXPMx3gGsErP2vA2PMdBVsY3EOnA&e=
>> 
>> There we noticed about 2x performance degradation in the reduce phase on
>> ext3. I am not aware of any other concrete numbers. Maybe others have
>>more
>> experiences to add.
>> 
>> -Andrew
>> 
>> 2014-11-03 17:26 GMT-08:00 Matt Cheah <mch...@palantir.com>:
>> 
>>> Hi everyone,
>>> 
>>> I'm running into more and more cases where too many files are opened
>>>when
>>> spark.shuffle.consolidateFiles is turned off.
>>> 
>>> I was wondering if this is a common scenario among the rest of the
>>> community, and if so, if it is worth considering the setting to be
>>>turned
>>> on by default. From the documentation, it seems like the performance
>>>could
>>> be hurt on ext3 file systems. However, what are the concrete numbers of
>>> performance degradation that is seen typically? A 2x slowdown in the
>>> average job? 3x? Also, what cause the performance degradation on ext3
>>>file
>>> systems specifically?
>>> 
>>> Thanks,
>>> 
>>> -Matt Cheah
>>> 
>>> 
>>> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark shuffle consolidateFiles performance degradation numbers

Reply via email to