Note that speculation is off by default to avoid these kinds of unexpected issues.
On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > It's worth adding that there's no guaranteed that re-evaluated work would > be on the same host as before, and in the case of node failure, it is not > guaranteed to be elsewhere. > > this means things that depend on host-local information is going to > generate different numbers even if there are no other side effects. random > number generation for seeding RDD.sample() would be a case in point here. > > There's also the fact that if you enable speculative execution, then > operations may be repeated —even in the absence of any failure. If you are > doing side effect work, or don't have an output committer whose actions are > guaranteed to be atomic then you want to turn that option off. > > > On 27 Mar 2015, at 19:46, Patrick Wendell <pwend...@gmail.com> wrote: > > > > If you invoke this, you will get at-least-once semantics on failure. > > For instance, if a machine dies in the middle of executing the foreach > > for a single partition, that will be re-executed on another machine. > > It could even fully complete on one machine, but the machine dies > > immediately before reporting the result back to the driver. > > > > This means you need to make sure the side-effects are idempotent, or > > use some transactional locking. Spark's own output operations, such as > > saving to Hadoop, use such mechanisms. For instance, in the case of > > Hadoop it uses the OutputCommitter classes. > > > > - Patrick > > > > On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos <michal.klo...@gmail.com> > wrote: > >> Hi Spark group, > >> > >> We haven't been able to find clear descriptions of how Spark handles the > >> resiliency of RDDs in relationship to executing actions with > side-effects. > >> If you do an `rdd.foreach(someSideEffect)`, then you are doing a > side-effect > >> for each element in the RDD. If a partition goes down -- the resiliency > >> rebuilds the data, but did it keep track of how far it go in the > >> partition's set of data or will it start from the beginning again. So > will > >> it do at-least-once execution of foreach closures or at-most-once? > >> > >> thanks, > >> Michal > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >