I have one Spark job which seems to run fine but after one hour or so executor start getting lost because of time out something like the following error
cluster.yarnScheduler : Removing an executor 14 650000 timeout exceeds 600000 seconds and because of above error couple of chained errors starts to come like FetchFailedException, Rpc client disassociated, Connection reset by peer, IOException etc Please see the following UI page I have noticed when shuffle read/write starts to increase more than 10 GB executors starts getting lost because of timeout. How do I clear this stacked memory of 10 GB in shuffle read/write section I dont cache anything why Spark is not clearing those memory. Please guide. IMG_20150819_231418358.jpg <http://apache-spark-user-list.1001560.n3.nabble.com/file/n24345/IMG_20150819_231418358.jpg> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-executor-time-out-on-yarn-spark-while-dealing-with-large-shuffle-skewed-data-tp24345.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org