Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

Umesh Kacha Thu, 20 Aug 2015 08:55:21 -0700

Hi where do I see GC time in UI? I have set spark.yarn.executor.memoryOverhead
as 3500 which seems to be good enough I believe. So you mean only GC could
be the reason behind timeout I checked Yarn logs I did not see any GC error
there. Please guide. Thanks much.


On Thu, Aug 20, 2015 at 8:14 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Moving this back onto user@
>
> Regarding GC, can you look in the web UI and see whether the "GC time"
> metric dominates the amount of time spent on each task (or at least the
> tasks that aren't completing)?
>
> Also, have you tried bumping your spark.yarn.executor.memoryOverhead?
> YARN may be killing your executors for using too much off-heap space.  You
> can see whether this is happening by looking in the Spark AM or YARN
> NodeManager logs.
>
> -Sandy
>
> On Thu, Aug 20, 2015 at 7:39 AM, Umesh Kacha <umesh.ka...@gmail.com>
> wrote:
>
>> Hi thanks much for the response. Yes I tried default settings too 0.2 it
>> was also going into timeout if it is spending time in GC then why it is not
>> throwing GC error I don't see any such error. Yarn logs are not helpful at
>> all. What is tungsten how do I use it? Spark is doing great I believe my
>> job runs successfully and 60% tasks completes only after first executor
>> gets lost things are messing.
>> On Aug 20, 2015 7:59 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
>>
>>> What sounds most likely is that you're hitting heavy garbage
>>> collection.  Did you hit issues when the shuffle memory fraction was at its
>>> default of 0.2?  A potential danger with setting the shuffle storage to 0.7
>>> is that it allows shuffle objects to get into the GC old generation, which
>>> triggers more stop-the-world garbage collections.
>>>
>>> Have you tried enabling Tungsten / unsafe?
>>>
>>> Unfortunately, Spark is still not that great at dealing with
>>> heavily-skewed shuffle data, because its reduce-side aggregation still
>>> operates on Java objects instead of binary data.
>>>
>>> -Sandy
>>>
>>> On Thu, Aug 20, 2015 at 7:21 AM, Umesh Kacha <umesh.ka...@gmail.com>
>>> wrote:
>>>
>>>> Hi Sandy thanks much for the response. I am using Spark 1.4.1 and I
>>>> have set spark.shuffle.storage as 0.7 as my spark job involves 4 groupby
>>>> queries executed using hiveContext.sql my data set is skewed so will be
>>>> more shuffling I believe I don't know what's wrong spark job runs fine for
>>>> almost an hour and when shuffle read shuffle write column in UI starts to
>>>> show more than 10 gb executor starts to getting lost because of timeout and
>>>> slowly other executor starts getting lost. Please guide.
>>>> On Aug 20, 2015 7:38 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
>>>>
>>>>> What version of Spark are you using?  Have you set any shuffle configs?
>>>>>
>>>>> On Wed, Aug 19, 2015 at 11:46 AM, unk1102 <umesh.ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have one Spark job which seems to run fine but after one hour or so
>>>>>> executor start getting lost because of time out something like the
>>>>>> following
>>>>>> error
>>>>>>
>>>>>> cluster.yarnScheduler : Removing an executor 14 650000 timeout exceeds
>>>>>> 600000 seconds
>>>>>>
>>>>>> and because of above error couple of chained errors starts to come
>>>>>> like
>>>>>> FetchFailedException, Rpc client disassociated, Connection reset by
>>>>>> peer,
>>>>>> IOException etc
>>>>>>
>>>>>> Please see the following UI page I have noticed when shuffle
>>>>>> read/write
>>>>>> starts to increase more than 10 GB executors starts getting lost
>>>>>> because of
>>>>>> timeout. How do I clear this stacked memory of 10 GB in shuffle
>>>>>> read/write
>>>>>> section I dont cache anything why Spark is not clearing those memory.
>>>>>> Please
>>>>>> guide.
>>>>>>
>>>>>> IMG_20150819_231418358.jpg
>>>>>> <
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24345/IMG_20150819_231418358.jpg
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-executor-time-out-on-yarn-spark-while-dealing-with-large-shuffle-skewed-data-tp24345.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>
>

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

Reply via email to