If you are using the same RDDs in the both the attempts to run the job, the
previous stage outputs generated in the previous job will indeed be reused.
This applies to core though. For dataframes, depending on what you do, the
physical plan may get generated again leading to new RDDs which may cause
recomputing all the stages. Consider running the job by generating the RDD
from Dataframe and then using that.

Of course, you can use caching in both core and DataFrames, which will
solve all these concerns.

On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> Is it possible to restart the job from the last successful stage instead
> of from the beginning?
>
> For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long
> time and is successful, but the job fails on stage 1, it would be useful to
> be able to restart from the output of stage 0 instead of from the beginning.
>
> Note that I am NOT talking about Spark Streaming, just Spark Core (and
> DataFrames), not sure if the case would be different with Streaming.
>
> Thanks.
>

Reply via email to