Got it. Thanks! On Wed, Jan 21, 2015 at 1:47 PM, Nathan Marz <[email protected]> wrote:
> Spouts and bolts provides you an at-least once guarantee, so it's > completely up to you to figure out how to get your app to work with that. > Storm is unable to provide you any help besides replaying the tuples. > > Trident, on the other hand, does all state updates in the "State" > abstraction and gives you a monotonically increasing batch id whenever > state updates are to be applied. If you store that batch id with whatever > state you're updating, you can detect when you're seeing something that's > been successfully processed before or whether it's brand new. This is > described in that state doc I sent. > > On Wed, Jan 21, 2015 at 4:15 PM, Shawn Bonnin <[email protected]> > wrote: > >> Nathan, First, thanks a lot for the quick response. I read through the >> Trident guarantees. Seems like micro-batching will help with the exactly >> once guarantees on the bolts that write to external data stores in the >> commit phase of a batch. >> >> However, I have clarifying question to what you said - >> >> *Suppose for example your spout emits tuples A, B, C, D, E and tuple C >> fails. A spout like KestrelSpout would re-emit only tuple C. KafkaSpout, on >> the other hand, would also re-emit all tuples after the failed tuple. So it >> would re-emit C, D, and E, even if D and E were successfully processed'* >> >> My question is how does the Kafka Spout know how many tuples were sent >> through after C? Does it rely on Zookeeper to get the offsets and just >> replay everything after that offset? If yes then do we have to handle the >> repercussions of state corruption etc. in our downstream bolts? Our >> downstream bolts will be looking for event sequence based patterns so when >> they see the same event twice, they will need smarts to know when that was >> due to a system failure and replay vs. an actual business occurrence. >> >> Seems like these smarts will need to be built regardless of whether we do >> tuple at a time processing or use Trident. >> >> Am I correct in my assessment? >> >> >> Thanks a lot! >> >> On Wed, Jan 21, 2015 at 11:43 AM, Nathan Marz <[email protected]> >> wrote: >> >>> There's no such thing as a total order in a distributed system, as >>> streams are processed in parallel. The ordering guarantee Storm provides is >>> that tuples sent between tasks are received in the order they were sent. >>> >>> Another part of your question is what kind of ordering guarantees you >>> get during failures. With regular Storm, when a tuple fails it depends on >>> the spout to determine what to re-emit. Suppose for example your spout >>> emits tuples A, B, C, D, E and tuple C fails. A spout like KestrelSpout >>> would re-emit only tuple C. KafkaSpout, on the other hand, would also >>> re-emit all tuples after the failed tuple. So it would re-emit C, D, and E, >>> even if D and E were successfully processed. >>> >>> Trident provides stronger ordering guarantees, as it provides a total >>> ordering among the commit phases for batches. So if a batch fails to commit >>> it will be retried indefinitely until it succeeds. See >>> http://storm.apache.org/documentation/Trident-state.html and >>> http://storm.apache.org/documentation/Trident-spouts.html for more info >>> on this. >>> >>> On Wed, Jan 21, 2015 at 2:34 PM, Shawn Bonnin <[email protected]> >>> wrote: >>> >>>> Trying to look for patterns in the input stream based on the arrival >>>> sequence. We can use something like kafka on the input so guarantee order >>>> but once the tuples enter the topology, how can we make sure that they are >>>> processed in the same order as they arrived on Kafka. >>>> >>>> On Wed, Jan 21, 2015 at 11:30 AM, Naresh Kosgi <[email protected]> >>>> wrote: >>>> >>>>> Also more information about why you need a certain order for >>>>> processing would help in recommending how to approach the problem >>>>> >>>>> On Wed, Jan 21, 2015 at 2:28 PM, Naresh Kosgi <[email protected]> >>>>> wrote: >>>>> >>>>>> Storm as a framework does not guarantee order. You will have to code >>>>>> it if you would like your tuples processed in certain order >>>>>> >>>>>> On Wed, Jan 21, 2015 at 2:24 PM, Shawn Bonnin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Resending... >>>>>>> >>>>>>> Our use case requires the tuples be processed in order across >>>>>>> failures. >>>>>>> >>>>>>> So we have SpoutA sending data to bolt B &C and Bolt D is the last >>>>>>> bolt that aggregates data from B & C and writes to a database. >>>>>>> >>>>>>> We want to make sure that when we use tuple at a time processing OR >>>>>>> use the Trident API, the data always gets processed in the same order >>>>>>> as it >>>>>>> was read by our spout. Given that between Bolt B & C there would be >>>>>>> parallelism and intermittent failures, my question is the following - >>>>>>> >>>>>>> How does Storm guarantee processing order of tuples? >>>>>>> >>>>>>> Thanks in advance! >>>>>>> >>>>>>> On Wed, Jan 21, 2015 at 10:57 AM, Shawn Bonnin < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Our use case requires the tuples be processed in order across >>>>>>>> failures. >>>>>>>> >>>>>>>> So we have SpoutA sending data to bolt B &C and Bolt D is the last >>>>>>>> bolt that aggregates data from B & C and writes to a database. >>>>>>>> >>>>>>>> We want to make sure that when we use tuple at a time processing OR >>>>>>>> use the Trident API, the data always gets processed in the same order >>>>>>>> as it >>>>>>>> was read by our spout. Given that between Bolt B & C there would be >>>>>>>> parallelism and intermittent failures, my question is the following - >>>>>>>> >>>>>>>> How does Storm guarantee processing order of tuples? >>>>>>>> >>>>>>>> Thanks in advance! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> Twitter: @nathanmarz >>> http://nathanmarz.com >>> >> >> > > > -- > Twitter: @nathanmarz > http://nathanmarz.com >
