Re: RDD resiliency -- does it keep state?

Aaron Davidson Sat, 28 Mar 2015 10:50:03 -0700

Note that speculation is off by default to avoid these kinds of unexpected
issues.


On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> It's worth adding that there's no guaranteed that re-evaluated work would
> be on the same host as before, and in the case of node failure, it is not
> guaranteed to be elsewhere.
>
> this means things that depend on host-local information is going to
> generate different numbers even if there are no other side effects. random
> number generation for seeding RDD.sample() would be a case in point here.
>
> There's also the fact that if you enable speculative execution, then
> operations may be repeated —even in the absence of any failure. If you are
> doing side effect work, or don't have an output committer whose actions are
> guaranteed to be atomic then you want to turn that option off.
>
> > On 27 Mar 2015, at 19:46, Patrick Wendell <pwend...@gmail.com> wrote:
> >
> > If you invoke this, you will get at-least-once semantics on failure.
> > For instance, if a machine dies in the middle of executing the foreach
> > for a single partition, that will be re-executed on another machine.
> > It could even fully complete on one machine, but the machine dies
> > immediately before reporting the result back to the driver.
> >
> > This means you need to make sure the side-effects are idempotent, or
> > use some transactional locking. Spark's own output operations, such as
> > saving to Hadoop, use such mechanisms. For instance, in the case of
> > Hadoop it uses the OutputCommitter classes.
> >
> > - Patrick
> >
> > On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos <michal.klo...@gmail.com>
> wrote:
> >> Hi Spark group,
> >>
> >> We haven't been able to find clear descriptions of how Spark handles the
> >> resiliency of RDDs in relationship to executing actions with
> side-effects.
> >> If you do an `rdd.foreach(someSideEffect)`, then you are doing a
> side-effect
> >> for each element in the RDD. If a partition goes down -- the resiliency
> >> rebuilds the data,  but did it keep track of how far it go in the
> >> partition's set of data or will it start from the beginning again. So
> will
> >> it do at-least-once execution of foreach closures or at-most-once?
> >>
> >> thanks,
> >> Michal
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: RDD resiliency -- does it keep state?

Reply via email to