Re: Exactly once processing

2016-04-20 Thread Sabarish Sasidharan
Great, thanks all! Regards Sab On Tue, Apr 19, 2016 at 5:00 AM, Yi Pan wrote: > Hi, Sabarish Sasidharan, > > The key point is to make your KV-store update idempotent. So, if the offset > associated with the aggregated value are written in the same row in RocksDB > (i.e. atomicity is achieved he

Re: Exactly once processing

2016-04-18 Thread Yi Pan
Hi, Sabarish Sasidharan, The key point is to make your KV-store update idempotent. So, if the offset associated with the aggregated value are written in the same row in RocksDB (i.e. atomicity is achieved here), I think that your approach would work. As Robert mentioned, offsets are always committ

Re: Exactly once processing

2016-04-15 Thread Robert Crim
Looking at: https://github.com/apache/samza/blob/f02386464d31b5a496bb0578838f51a0331bfffa/samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala#L171 The commit function, in order, does: 1. Flushes metrics 2. Flushes stores 3. Produces messages from the collectors 4. Write offset

Re: Exactly once processing

2016-04-15 Thread Sabarish Sasidharan
Hi Guozhang Thanks. Assuming the checkpoint would typically be behind the offset persisted in my store (+ changelog), when the messages are replayed starting from the checkpoint, I can very well skip those by comparing against the offset in my store right? So I am not understanding why duplicates

Re: Exactly once processing

2016-04-15 Thread Guozhang Wang
Hi Sab, For stateful processing where you have persistent state stores, you need to maintain the checkpoint which includes the committed offsets as well as the store flushed in sync, but right not these two operations are not done atomically, and hence if you fail in between, you could still get d