If not resilient at spark level, can't you just relaunch you job with your orchestration tool ?
Le 21 déc. 2017 09:34, "Georg Heiler" <georg.kf.hei...@gmail.com> a écrit : > Die you try to use the yarn Shuffle Service? > chopinxb <chopi...@gmail.com> schrieb am Do. 21. Dez. 2017 um 04:43: > >> In my practice of spark application(almost Spark-SQL) , when there is a >> complete node failure in my cluster, jobs which have shuffle blocks on the >> node will completely fail after 4 task retries. It seems that data >> lineage >> didn't work. What' more, our applications use multiple SQL statements for >> data analysis. After a lengthy calculation, entire application failed >> because of one job failure is unacceptable. So we consider more stability >> rather than speed in some way. >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>