Re: RDD resiliency -- does it keep state?

Michal Klos Sat, 28 Mar 2015 10:56:12 -0700

got it thanks. Making sure everything is idempotent is definitely a
critical piece for peace of mind.


On Sat, Mar 28, 2015 at 1:47 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Note that speculation is off by default to avoid these kinds of unexpected
> issues.
>
> On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>>
>> It's worth adding that there's no guaranteed that re-evaluated work would
>> be on the same host as before, and in the case of node failure, it is not
>> guaranteed to be elsewhere.
>>
>> this means things that depend on host-local information is going to
>> generate different numbers even if there are no other side effects. random
>> number generation for seeding RDD.sample() would be a case in point here.
>>
>> There's also the fact that if you enable speculative execution, then
>> operations may be repeated —even in the absence of any failure. If you are
>> doing side effect work, or don't have an output committer whose actions are
>> guaranteed to be atomic then you want to turn that option off.
>>
>> > On 27 Mar 2015, at 19:46, Patrick Wendell <pwend...@gmail.com> wrote:
>> >
>> > If you invoke this, you will get at-least-once semantics on failure.
>> > For instance, if a machine dies in the middle of executing the foreach
>> > for a single partition, that will be re-executed on another machine.
>> > It could even fully complete on one machine, but the machine dies
>> > immediately before reporting the result back to the driver.
>> >
>> > This means you need to make sure the side-effects are idempotent, or
>> > use some transactional locking. Spark's own output operations, such as
>> > saving to Hadoop, use such mechanisms. For instance, in the case of
>> > Hadoop it uses the OutputCommitter classes.
>> >
>> > - Patrick
>> >
>> > On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos <michal.klo...@gmail.com>
>> wrote:
>> >> Hi Spark group,
>> >>
>> >> We haven't been able to find clear descriptions of how Spark handles
>> the
>> >> resiliency of RDDs in relationship to executing actions with
>> side-effects.
>> >> If you do an `rdd.foreach(someSideEffect)`, then you are doing a
>> side-effect
>> >> for each element in the RDD. If a partition goes down -- the resiliency
>> >> rebuilds the data,  but did it keep track of how far it go in the
>> >> partition's set of data or will it start from the beginning again. So
>> will
>> >> it do at-least-once execution of foreach closures or at-most-once?
>> >>
>> >> thanks,
>> >> Michal
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: RDD resiliency -- does it keep state?

Reply via email to