Hi where do I see GC time in UI? I have set spark.yarn.executor.memoryOverhead as 3500 which seems to be good enough I believe. So you mean only GC could be the reason behind timeout I checked Yarn logs I did not see any GC error there. Please guide. Thanks much.
On Thu, Aug 20, 2015 at 8:14 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Moving this back onto user@ > > Regarding GC, can you look in the web UI and see whether the "GC time" > metric dominates the amount of time spent on each task (or at least the > tasks that aren't completing)? > > Also, have you tried bumping your spark.yarn.executor.memoryOverhead? > YARN may be killing your executors for using too much off-heap space. You > can see whether this is happening by looking in the Spark AM or YARN > NodeManager logs. > > -Sandy > > On Thu, Aug 20, 2015 at 7:39 AM, Umesh Kacha <umesh.ka...@gmail.com> > wrote: > >> Hi thanks much for the response. Yes I tried default settings too 0.2 it >> was also going into timeout if it is spending time in GC then why it is not >> throwing GC error I don't see any such error. Yarn logs are not helpful at >> all. What is tungsten how do I use it? Spark is doing great I believe my >> job runs successfully and 60% tasks completes only after first executor >> gets lost things are messing. >> On Aug 20, 2015 7:59 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote: >> >>> What sounds most likely is that you're hitting heavy garbage >>> collection. Did you hit issues when the shuffle memory fraction was at its >>> default of 0.2? A potential danger with setting the shuffle storage to 0.7 >>> is that it allows shuffle objects to get into the GC old generation, which >>> triggers more stop-the-world garbage collections. >>> >>> Have you tried enabling Tungsten / unsafe? >>> >>> Unfortunately, Spark is still not that great at dealing with >>> heavily-skewed shuffle data, because its reduce-side aggregation still >>> operates on Java objects instead of binary data. >>> >>> -Sandy >>> >>> On Thu, Aug 20, 2015 at 7:21 AM, Umesh Kacha <umesh.ka...@gmail.com> >>> wrote: >>> >>>> Hi Sandy thanks much for the response. I am using Spark 1.4.1 and I >>>> have set spark.shuffle.storage as 0.7 as my spark job involves 4 groupby >>>> queries executed using hiveContext.sql my data set is skewed so will be >>>> more shuffling I believe I don't know what's wrong spark job runs fine for >>>> almost an hour and when shuffle read shuffle write column in UI starts to >>>> show more than 10 gb executor starts to getting lost because of timeout and >>>> slowly other executor starts getting lost. Please guide. >>>> On Aug 20, 2015 7:38 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote: >>>> >>>>> What version of Spark are you using? Have you set any shuffle configs? >>>>> >>>>> On Wed, Aug 19, 2015 at 11:46 AM, unk1102 <umesh.ka...@gmail.com> >>>>> wrote: >>>>> >>>>>> I have one Spark job which seems to run fine but after one hour or so >>>>>> executor start getting lost because of time out something like the >>>>>> following >>>>>> error >>>>>> >>>>>> cluster.yarnScheduler : Removing an executor 14 650000 timeout exceeds >>>>>> 600000 seconds >>>>>> >>>>>> and because of above error couple of chained errors starts to come >>>>>> like >>>>>> FetchFailedException, Rpc client disassociated, Connection reset by >>>>>> peer, >>>>>> IOException etc >>>>>> >>>>>> Please see the following UI page I have noticed when shuffle >>>>>> read/write >>>>>> starts to increase more than 10 GB executors starts getting lost >>>>>> because of >>>>>> timeout. How do I clear this stacked memory of 10 GB in shuffle >>>>>> read/write >>>>>> section I dont cache anything why Spark is not clearing those memory. >>>>>> Please >>>>>> guide. >>>>>> >>>>>> IMG_20150819_231418358.jpg >>>>>> < >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24345/IMG_20150819_231418358.jpg >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-executor-time-out-on-yarn-spark-while-dealing-with-large-shuffle-skewed-data-tp24345.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>> >>> >