Re: Does DataSet job also use Barriers to ensure "exactly once."?

Stephan Ewen Thu, 09 Jul 2015 07:49:18 -0700

Any operator in a batch job will receive all of its elements in one
complete successful run.


The mapper starts its work immediately. On a failure, a fresh mapper is
used, and all of the data is replayed. You can think of it as if there was
only a single checkpoint at the very beginning (before any data was sent)
that they fall back to. For mapper-internal state, there can be no
duplicates.

For the interaction with the outside world, there can always be duplicates,
for example if the mapper inserts data into a database. The database would
have data from the initial run (that failed or was canceled) and the
recovery run.




On Thu, Jul 9, 2015 at 4:13 PM, 马国维 <maguo...@outlook.com> wrote:

> DataSet<String> result = in.rebalance()
>                            .map(new Mapper());In the case  does the 'map'
> receive all the data then begin to worker?Will rebalance operator failed
> cause some duplicate record if the above answer is false ?
> > Date: Thu, 9 Jul 2015 15:40:18 +0200
> > Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> once."?
> > From: se...@apache.org
> > To: dev@flink.apache.org
> >
> > Currently, Flink restarts the entire job upon failure.
> >
> > There is WIP that restricts this to all tasks involved in the pipeline of
> > the failed task.
> >
> > Let's say we have pipelined MapReduce. If a mapper fails, the reducers
> that
> > have received some data already have to be restarted as well.
> >
> > In that case, pipelined exchange works like "speculatively" starting the
> > reducers early. It helps when no failure occurs.
> > When a failure occurs, the reducers do still not start later than in a
> > batch exchange mode, where they are started only once the mappers are
> done
> > (and no failure can occur any more).
> >
> >
> > On Thu, Jul 9, 2015 at 3:34 PM, 马国维 <maguo...@outlook.com> wrote:
> >
> > > DataExchangeMode is Piped
> > > If Two operators use Piped Mode to exchange the data , Failed
> partitions
> > > have  already send some data to the receiver before it failed.So Does
> > > Replaying all the failed partitions  cause some duplicate records ?
> > >
> > >
> > > > Date: Thu, 9 Jul 2015 14:47:29 +0200
> > > > Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> > > once."?
> > > > From: ktzou...@apache.org
> > > > To: dev@flink.apache.org
> > > >
> > > > No, it doesn't; periodic snapshots are not needed in DataSet
> programs, as
> > > > DataSets are of finite size and failed partitions can be replayed
> > > > completely.
> > > >
> > > >
> > > > On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <maguo...@outlook.com> wrote:
> > > >
> > > > > hi, everyoneThe doc say Flink Streaming use "Barriers" to  ensure
> > > > > "exactly once."Does the DataSet job use the same mechanism to ensue
> > > > > "exactly once"  if a map task is failed?thanks
> > > > >
> > >
> > >
>
>

Re: Does DataSet job also use Barriers to ensure "exactly once."?

Reply via email to