GC wouldn't necessarily result in errors - it could just be slowing down your job and causing the executor JVMs to stall. If you click on a stage in the UI, you should end up on a page with all the metrics concerning the tasks that ran in that stage. "GC Time" is one of these task metrics.
-Sandy On Thu, Aug 20, 2015 at 8:54 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi where do I see GC time in UI? I have set spark.yarn.executor.memoryOverhead > as 3500 which seems to be good enough I believe. So you mean only GC could > be the reason behind timeout I checked Yarn logs I did not see any GC error > there. Please guide. Thanks much. > > On Thu, Aug 20, 2015 at 8:14 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> Moving this back onto user@ >> >> Regarding GC, can you look in the web UI and see whether the "GC time" >> metric dominates the amount of time spent on each task (or at least the >> tasks that aren't completing)? >> >> Also, have you tried bumping your spark.yarn.executor.memoryOverhead? >> YARN may be killing your executors for using too much off-heap space. You >> can see whether this is happening by looking in the Spark AM or YARN >> NodeManager logs. >> >> -Sandy >> >> On Thu, Aug 20, 2015 at 7:39 AM, Umesh Kacha <umesh.ka...@gmail.com> >> wrote: >> >>> Hi thanks much for the response. Yes I tried default settings too 0.2 it >>> was also going into timeout if it is spending time in GC then why it is not >>> throwing GC error I don't see any such error. Yarn logs are not helpful at >>> all. What is tungsten how do I use it? Spark is doing great I believe my >>> job runs successfully and 60% tasks completes only after first executor >>> gets lost things are messing. >>> On Aug 20, 2015 7:59 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote: >>> >>>> What sounds most likely is that you're hitting heavy garbage >>>> collection. Did you hit issues when the shuffle memory fraction was at its >>>> default of 0.2? A potential danger with setting the shuffle storage to 0.7 >>>> is that it allows shuffle objects to get into the GC old generation, which >>>> triggers more stop-the-world garbage collections. >>>> >>>> Have you tried enabling Tungsten / unsafe? >>>> >>>> Unfortunately, Spark is still not that great at dealing with >>>> heavily-skewed shuffle data, because its reduce-side aggregation still >>>> operates on Java objects instead of binary data. >>>> >>>> -Sandy >>>> >>>> On Thu, Aug 20, 2015 at 7:21 AM, Umesh Kacha <umesh.ka...@gmail.com> >>>> wrote: >>>> >>>>> Hi Sandy thanks much for the response. I am using Spark 1.4.1 and I >>>>> have set spark.shuffle.storage as 0.7 as my spark job involves 4 groupby >>>>> queries executed using hiveContext.sql my data set is skewed so will be >>>>> more shuffling I believe I don't know what's wrong spark job runs fine for >>>>> almost an hour and when shuffle read shuffle write column in UI starts to >>>>> show more than 10 gb executor starts to getting lost because of timeout >>>>> and >>>>> slowly other executor starts getting lost. Please guide. >>>>> On Aug 20, 2015 7:38 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote: >>>>> >>>>>> What version of Spark are you using? Have you set any shuffle >>>>>> configs? >>>>>> >>>>>> On Wed, Aug 19, 2015 at 11:46 AM, unk1102 <umesh.ka...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I have one Spark job which seems to run fine but after one hour or so >>>>>>> executor start getting lost because of time out something like the >>>>>>> following >>>>>>> error >>>>>>> >>>>>>> cluster.yarnScheduler : Removing an executor 14 650000 timeout >>>>>>> exceeds >>>>>>> 600000 seconds >>>>>>> >>>>>>> and because of above error couple of chained errors starts to come >>>>>>> like >>>>>>> FetchFailedException, Rpc client disassociated, Connection reset by >>>>>>> peer, >>>>>>> IOException etc >>>>>>> >>>>>>> Please see the following UI page I have noticed when shuffle >>>>>>> read/write >>>>>>> starts to increase more than 10 GB executors starts getting lost >>>>>>> because of >>>>>>> timeout. How do I clear this stacked memory of 10 GB in shuffle >>>>>>> read/write >>>>>>> section I dont cache anything why Spark is not clearing those >>>>>>> memory. Please >>>>>>> guide. >>>>>>> >>>>>>> IMG_20150819_231418358.jpg >>>>>>> < >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24345/IMG_20150819_231418358.jpg >>>>>>> > >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-executor-time-out-on-yarn-spark-while-dealing-with-large-shuffle-skewed-data-tp24345.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>> >>>> >> >