FW: High Availability of Spark Driver

Ashish Rawat Thu, 27 Aug 2015 00:42:50 -0700

Hi Patrick,

As discussed in another thread, we are looking for a solution to the problem of 
lost state on Spark Driver failure. Can you please share Spark's long term 
strategy for resolving this problem.


<-- Original Mail Content Below -->

We have come across the problem of Spark Applications (on Yarn) requiring a 
restart in case of Spark Driver (or application master) going down. This is 
hugely inconvenient for long running applications which are maintaining a big 
state in memory. The repopulation of state in itself may require a downtime of 
many minutes, which is not acceptable for most live systems.

As you would have noticed that Yarn community has acknowledged "long running 
services" as an important class of use cases, and thus identified and removed 
problems in working with long running services in Yarn.
http://hortonworks.com/blog/support-long-running-services-hadoop-yarn-clusters/

It would be great if Spark, which is the most important processing engine on 
Yarn, also figures out issues in working with long running Spark applications 
and publishes recommendations or make framework changes for removing those. The 
need to keep the application running in case of Driver and Application Master 
failure, seems to be an important requirement from this perspective. The two 
most compelling use cases being:

  1.  Huge state of historical data in Spark Streaming, required for stream 
processing
  2.  Very large cached tables in Spark SQL (very close to our use case where 
we periodically cache RDDs and query using Spark SQL)

In our analysis, for both of these use cases, a working HA solution can be 
built by

  1.  Preserving the state of executors (not killing them on driver failures)
  2.  Persisting some meta info required by Spark SQL and Block Manager.
  3.  Restarting and reconnecting the Driver to AM and executors

This would preserve the cached data, enabling the application to come back 
quickly. This can further be extended for preserving and recovering the 
Computation State.

I would request you to share your thoughts on this issue and possible future 
directions.

Regards,
Ashish

FW: High Availability of Spark Driver

Reply via email to