Any operator in a batch job will receive all of its elements in one complete successful run.
The mapper starts its work immediately. On a failure, a fresh mapper is used, and all of the data is replayed. You can think of it as if there was only a single checkpoint at the very beginning (before any data was sent) that they fall back to. For mapper-internal state, there can be no duplicates. For the interaction with the outside world, there can always be duplicates, for example if the mapper inserts data into a database. The database would have data from the initial run (that failed or was canceled) and the recovery run. On Thu, Jul 9, 2015 at 4:13 PM, 马国维 <maguo...@outlook.com> wrote: > DataSet<String> result = in.rebalance() > .map(new Mapper());In the case does the 'map' > receive all the data then begin to worker?Will rebalance operator failed > cause some duplicate record if the above answer is false ? > > Date: Thu, 9 Jul 2015 15:40:18 +0200 > > Subject: Re: Does DataSet job also use Barriers to ensure "exactly > once."? > > From: se...@apache.org > > To: dev@flink.apache.org > > > > Currently, Flink restarts the entire job upon failure. > > > > There is WIP that restricts this to all tasks involved in the pipeline of > > the failed task. > > > > Let's say we have pipelined MapReduce. If a mapper fails, the reducers > that > > have received some data already have to be restarted as well. > > > > In that case, pipelined exchange works like "speculatively" starting the > > reducers early. It helps when no failure occurs. > > When a failure occurs, the reducers do still not start later than in a > > batch exchange mode, where they are started only once the mappers are > done > > (and no failure can occur any more). > > > > > > On Thu, Jul 9, 2015 at 3:34 PM, 马国维 <maguo...@outlook.com> wrote: > > > > > DataExchangeMode is Piped > > > If Two operators use Piped Mode to exchange the data , Failed > partitions > > > have already send some data to the receiver before it failed.So Does > > > Replaying all the failed partitions cause some duplicate records ? > > > > > > > > > > Date: Thu, 9 Jul 2015 14:47:29 +0200 > > > > Subject: Re: Does DataSet job also use Barriers to ensure "exactly > > > once."? > > > > From: ktzou...@apache.org > > > > To: dev@flink.apache.org > > > > > > > > No, it doesn't; periodic snapshots are not needed in DataSet > programs, as > > > > DataSets are of finite size and failed partitions can be replayed > > > > completely. > > > > > > > > > > > > On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <maguo...@outlook.com> wrote: > > > > > > > > > hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure > > > > > "exactly once."Does the DataSet job use the same mechanism to ensue > > > > > "exactly once" if a map task is failed?thanks > > > > > > > > > > > > >