got it thanks. Making sure everything is idempotent is definitely a critical piece for peace of mind.
On Sat, Mar 28, 2015 at 1:47 PM, Aaron Davidson <ilike...@gmail.com> wrote: > Note that speculation is off by default to avoid these kinds of unexpected > issues. > > On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran <ste...@hortonworks.com> > wrote: > >> >> It's worth adding that there's no guaranteed that re-evaluated work would >> be on the same host as before, and in the case of node failure, it is not >> guaranteed to be elsewhere. >> >> this means things that depend on host-local information is going to >> generate different numbers even if there are no other side effects. random >> number generation for seeding RDD.sample() would be a case in point here. >> >> There's also the fact that if you enable speculative execution, then >> operations may be repeated —even in the absence of any failure. If you are >> doing side effect work, or don't have an output committer whose actions are >> guaranteed to be atomic then you want to turn that option off. >> >> > On 27 Mar 2015, at 19:46, Patrick Wendell <pwend...@gmail.com> wrote: >> > >> > If you invoke this, you will get at-least-once semantics on failure. >> > For instance, if a machine dies in the middle of executing the foreach >> > for a single partition, that will be re-executed on another machine. >> > It could even fully complete on one machine, but the machine dies >> > immediately before reporting the result back to the driver. >> > >> > This means you need to make sure the side-effects are idempotent, or >> > use some transactional locking. Spark's own output operations, such as >> > saving to Hadoop, use such mechanisms. For instance, in the case of >> > Hadoop it uses the OutputCommitter classes. >> > >> > - Patrick >> > >> > On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos <michal.klo...@gmail.com> >> wrote: >> >> Hi Spark group, >> >> >> >> We haven't been able to find clear descriptions of how Spark handles >> the >> >> resiliency of RDDs in relationship to executing actions with >> side-effects. >> >> If you do an `rdd.foreach(someSideEffect)`, then you are doing a >> side-effect >> >> for each element in the RDD. If a partition goes down -- the resiliency >> >> rebuilds the data, but did it keep track of how far it go in the >> >> partition's set of data or will it start from the beginning again. So >> will >> >> it do at-least-once execution of foreach closures or at-most-once? >> >> >> >> thanks, >> >> Michal >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >