Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

Sandy Ryza Thu, 20 Aug 2015 14:22:42 -0700

GC wouldn't necessarily result in errors - it could just be slowing down
your job and causing the executor JVMs to stall.  If you click on a stage
in the UI, you should end up on a page with all the metrics concerning the
tasks that ran in that stage.  "GC Time" is one of these task metrics.


-Sandy

On Thu, Aug 20, 2015 at 8:54 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi where do I see GC time in UI? I have set spark.yarn.executor.memoryOverhead
> as 3500 which seems to be good enough I believe. So you mean only GC could
> be the reason behind timeout I checked Yarn logs I did not see any GC error
> there. Please guide. Thanks much.
>
> On Thu, Aug 20, 2015 at 8:14 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
>> Moving this back onto user@
>>
>> Regarding GC, can you look in the web UI and see whether the "GC time"
>> metric dominates the amount of time spent on each task (or at least the
>> tasks that aren't completing)?
>>
>> Also, have you tried bumping your spark.yarn.executor.memoryOverhead?
>> YARN may be killing your executors for using too much off-heap space.  You
>> can see whether this is happening by looking in the Spark AM or YARN
>> NodeManager logs.
>>
>> -Sandy
>>
>> On Thu, Aug 20, 2015 at 7:39 AM, Umesh Kacha <umesh.ka...@gmail.com>
>> wrote:
>>
>>> Hi thanks much for the response. Yes I tried default settings too 0.2 it
>>> was also going into timeout if it is spending time in GC then why it is not
>>> throwing GC error I don't see any such error. Yarn logs are not helpful at
>>> all. What is tungsten how do I use it? Spark is doing great I believe my
>>> job runs successfully and 60% tasks completes only after first executor
>>> gets lost things are messing.
>>> On Aug 20, 2015 7:59 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
>>>
>>>> What sounds most likely is that you're hitting heavy garbage
>>>> collection.  Did you hit issues when the shuffle memory fraction was at its
>>>> default of 0.2?  A potential danger with setting the shuffle storage to 0.7
>>>> is that it allows shuffle objects to get into the GC old generation, which
>>>> triggers more stop-the-world garbage collections.
>>>>
>>>> Have you tried enabling Tungsten / unsafe?
>>>>
>>>> Unfortunately, Spark is still not that great at dealing with
>>>> heavily-skewed shuffle data, because its reduce-side aggregation still
>>>> operates on Java objects instead of binary data.
>>>>
>>>> -Sandy
>>>>
>>>> On Thu, Aug 20, 2015 at 7:21 AM, Umesh Kacha <umesh.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Sandy thanks much for the response. I am using Spark 1.4.1 and I
>>>>> have set spark.shuffle.storage as 0.7 as my spark job involves 4 groupby
>>>>> queries executed using hiveContext.sql my data set is skewed so will be
>>>>> more shuffling I believe I don't know what's wrong spark job runs fine for
>>>>> almost an hour and when shuffle read shuffle write column in UI starts to
>>>>> show more than 10 gb executor starts to getting lost because of timeout 
>>>>> and
>>>>> slowly other executor starts getting lost. Please guide.
>>>>> On Aug 20, 2015 7:38 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
>>>>>
>>>>>> What version of Spark are you using?  Have you set any shuffle
>>>>>> configs?
>>>>>>
>>>>>> On Wed, Aug 19, 2015 at 11:46 AM, unk1102 <umesh.ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I have one Spark job which seems to run fine but after one hour or so
>>>>>>> executor start getting lost because of time out something like the
>>>>>>> following
>>>>>>> error
>>>>>>>
>>>>>>> cluster.yarnScheduler : Removing an executor 14 650000 timeout
>>>>>>> exceeds
>>>>>>> 600000 seconds
>>>>>>>
>>>>>>> and because of above error couple of chained errors starts to come
>>>>>>> like
>>>>>>> FetchFailedException, Rpc client disassociated, Connection reset by
>>>>>>> peer,
>>>>>>> IOException etc
>>>>>>>
>>>>>>> Please see the following UI page I have noticed when shuffle
>>>>>>> read/write
>>>>>>> starts to increase more than 10 GB executors starts getting lost
>>>>>>> because of
>>>>>>> timeout. How do I clear this stacked memory of 10 GB in shuffle
>>>>>>> read/write
>>>>>>> section I dont cache anything why Spark is not clearing those
>>>>>>> memory. Please
>>>>>>> guide.
>>>>>>>
>>>>>>> IMG_20150819_231418358.jpg
>>>>>>> <
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24345/IMG_20150819_231418358.jpg
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-executor-time-out-on-yarn-spark-while-dealing-with-large-shuffle-skewed-data-tp24345.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

Reply via email to