Sorry, I sent this to the dev list instead of user.  Please ignore.  I'll
re-post to the correct list.

Regards,
Art



On Thu, Jul 24, 2014 at 11:09 AM, Art Peel <found...@gmail.com> wrote:

> Our system works with RDDs generated from Hadoop files. It processes each
> record in a Hadoop file and for a subset of those records generates output
> that is written to an external system via RDD.foreach. There are no
> dependencies between the records that are processed.
>
> If writing to the external system fails (due to a detail of what is being
> written) and throws an exception, I see the following behavior:
>
> 1. Spark retries the entire partition (thus wasting time and effort),
> reaches the problem record and fails again.
> 2. It repeats step 1 until the configured number of retries is reached and
> then gives up. As a result, the rest of records from that Hadoop file are
> not processed.
> 3. The executor where the final attempt occurred is marked as failed and
> told to shut down and thus I lose a core for processing the remaining
> Hadoop files, thus slowing down the entire process.
>
> For this particular problem, I know how to prevent the underlying
> exception, but I'd still like to get a handle on error handling for future
> situations that I haven't yet encountered.
>
> My goal is this:
> Retry the problem record only (rather than starting over at the beginning
> of the partition) up to N times, then give up and move on to process the
> rest of the partition.
>
> As far as I can tell, I need to supply my own retry behavior and if I want
> to process records after the problem record I have to swallow exceptions
> inside the foreach block.
>
> My 2 questions are:
> 1. Is there anything I can do to prevent the executor where the final
> retry occurred from being shut down when a failure occurs?
>
> 2. Are there ways Spark can help me get closer to my goal of retrying only
> the problem record without writing my own re-try code and swallowing
> exceptions?
>
> Regards,
> Art
>
>

Reply via email to