Got it. Thanks!

On Wed, Jan 21, 2015 at 1:47 PM, Nathan Marz <[email protected]> wrote:

> Spouts and bolts provides you an at-least once guarantee, so it's
> completely up to you to figure out how to get your app to work with that.
> Storm is unable to provide you any help besides replaying the tuples.
>
> Trident, on the other hand, does all state updates in the "State"
> abstraction and gives you a monotonically increasing batch id whenever
> state updates are to be applied. If you store that batch id with whatever
> state you're updating, you can detect when you're seeing something that's
> been successfully processed before or whether it's brand new. This is
> described in that state doc I sent.
>
> On Wed, Jan 21, 2015 at 4:15 PM, Shawn Bonnin <[email protected]>
> wrote:
>
>> Nathan, First, thanks a lot for the quick response. I read through the
>> Trident guarantees. Seems like micro-batching will help with the exactly
>> once guarantees on the bolts that write to external data stores in the
>> commit phase of a batch.
>>
>> However, I have  clarifying question to what you said -
>>
>> *Suppose for example your spout emits tuples A, B, C, D, E and tuple C
>> fails. A spout like KestrelSpout would re-emit only tuple C. KafkaSpout, on
>> the other hand, would also re-emit all tuples after the failed tuple. So it
>> would re-emit C, D, and E, even if D and E were successfully processed'*
>>
>> My question is how does the Kafka Spout know how many tuples were sent
>> through after C? Does it rely on Zookeeper to get the offsets and just
>> replay everything after that offset? If yes then do we have to handle the
>> repercussions of state corruption etc. in our downstream bolts? Our
>> downstream bolts will be looking for event sequence based patterns so when
>> they see the same event twice, they will need smarts to know when that was
>> due to a system failure and replay vs. an actual business occurrence.
>>
>> Seems like these smarts will need to be built regardless of whether we do
>> tuple at a time processing or use Trident.
>>
>> Am I correct in my assessment?
>>
>>
>> Thanks a lot!
>>
>> On Wed, Jan 21, 2015 at 11:43 AM, Nathan Marz <[email protected]>
>> wrote:
>>
>>> There's no such thing as a total order in a distributed system, as
>>> streams are processed in parallel. The ordering guarantee Storm provides is
>>> that tuples sent between tasks are received in the order they were sent.
>>>
>>> Another part of your question is what kind of ordering guarantees you
>>> get during failures. With regular Storm, when a tuple fails it depends on
>>> the spout to determine what to re-emit. Suppose for example your spout
>>> emits tuples A, B, C, D, E and tuple C fails. A spout like KestrelSpout
>>> would re-emit only tuple C. KafkaSpout, on the other hand, would also
>>> re-emit all tuples after the failed tuple. So it would re-emit C, D, and E,
>>> even if D and E were successfully processed.
>>>
>>> Trident provides stronger ordering guarantees, as it provides a total
>>> ordering among the commit phases for batches. So if a batch fails to commit
>>> it will be retried indefinitely until it succeeds. See
>>> http://storm.apache.org/documentation/Trident-state.html and
>>> http://storm.apache.org/documentation/Trident-spouts.html for more info
>>> on this.
>>>
>>> On Wed, Jan 21, 2015 at 2:34 PM, Shawn Bonnin <[email protected]>
>>> wrote:
>>>
>>>> Trying to look for patterns in the input stream based on the arrival
>>>> sequence. We can use something like kafka on the input so guarantee order
>>>> but once the tuples enter the topology, how can we make sure that they are
>>>> processed in the same order as they arrived on Kafka.
>>>>
>>>> On Wed, Jan 21, 2015 at 11:30 AM, Naresh Kosgi <[email protected]>
>>>> wrote:
>>>>
>>>>> Also more information about why you need a certain order for
>>>>> processing would help in recommending how to approach the problem
>>>>>
>>>>> On Wed, Jan 21, 2015 at 2:28 PM, Naresh Kosgi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Storm as a framework does not guarantee order.  You will have to code
>>>>>> it if you would like your tuples processed in certain order
>>>>>>
>>>>>> On Wed, Jan 21, 2015 at 2:24 PM, Shawn Bonnin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Resending...
>>>>>>>
>>>>>>> Our use case requires the tuples be processed in order across
>>>>>>> failures.
>>>>>>>
>>>>>>> So we have SpoutA sending data to bolt B &C and Bolt D is the last
>>>>>>> bolt that aggregates data from B & C and writes to a database.
>>>>>>>
>>>>>>> We want to make sure that when we use tuple at a time processing OR
>>>>>>> use the Trident API, the data always gets processed in the same order 
>>>>>>> as it
>>>>>>> was read by our spout. Given that between Bolt B & C there would be
>>>>>>> parallelism and intermittent failures, my question is  the following -
>>>>>>>
>>>>>>> How does Storm guarantee processing order of tuples?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> On Wed, Jan 21, 2015 at 10:57 AM, Shawn Bonnin <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Our use case requires the tuples be processed in order across
>>>>>>>> failures.
>>>>>>>>
>>>>>>>> So we have SpoutA sending data to bolt B &C and Bolt D is the last
>>>>>>>> bolt that aggregates data from B & C and writes to a database.
>>>>>>>>
>>>>>>>> We want to make sure that when we use tuple at a time processing OR
>>>>>>>> use the Trident API, the data always gets processed in the same order 
>>>>>>>> as it
>>>>>>>> was read by our spout. Given that between Bolt B & C there would be
>>>>>>>> parallelism and intermittent failures, my question is  the following -
>>>>>>>>
>>>>>>>> How does Storm guarantee processing order of tuples?
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Twitter: @nathanmarz
>>> http://nathanmarz.com
>>>
>>
>>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com
>

Reply via email to