Re: Spark or Storm

Enno Shioji Wed, 17 Jun 2015 06:20:15 -0700

Processing stuff in batch is not the same thing as being transactional. If
you look at Storm, it will e.g. skip tuples that were already applied to a
state to avoid counting stuff twice etc. Spark doesn't come with such
facility, so you could end up counting twice etc.




On Wed, Jun 17, 2015 at 2:09 PM, Ashish Soni <[email protected]> wrote:

> Stream can also be processed in micro-batch / batches which is the main
> reason behind Spark Steaming so what is the difference ?
>
> Ashish
>
> On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji <[email protected]> wrote:
>
>> PS just to elaborate on my first sentence, the reason Spark (not
>> streaming) can offer exactly once semantics is because its update operation
>> is idempotent. This is easy to do in a batch context because the input is
>> finite, but it's harder in streaming context.
>>
>> On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji <[email protected]> wrote:
>>
>>> So Spark (not streaming) does offer exactly once. Spark Streaming
>>> however, can only do exactly once semantics *if the update operation is
>>> idempotent*. updateStateByKey's update operation is idempotent, because
>>> it completely replaces the previous state.
>>>
>>> So as long as you use Spark streaming, you must somehow make the update
>>> operation idempotent. Replacing the entire state is the easiest way to do
>>> it, but it's obviously expensive.
>>>
>>> The alternative is to do something similar to what Storm does. At that
>>> point, you'll have to ask though if just using Storm is easier than that.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni <[email protected]>
>>> wrote:
>>>
>>>> As per my Best Understanding Spark Streaming offer Exactly once
>>>> processing , is this achieve only through updateStateByKey or there is
>>>> another way to do the same.
>>>>
>>>> Ashish
>>>>
>>>> On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji <[email protected]> wrote:
>>>>
>>>>> In that case I assume you need exactly once semantics. There's no
>>>>> out-of-the-box way to do that in Spark. There is updateStateByKey, but 
>>>>> it's
>>>>> not practical with your use case as the state is too large (it'll try to
>>>>> dump the entire intermediate state on every checkpoint, which would be
>>>>> prohibitively expensive).
>>>>>
>>>>> So either you have to implement something yourself, or you can use
>>>>> Storm Trident (or transactional low-level API).
>>>>>
>>>>> On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> My Use case is below
>>>>>>
>>>>>> We are going to receive lot of event as stream ( basically Kafka
>>>>>> Stream ) and then we need to process and compute
>>>>>>
>>>>>> Consider you have a phone contract with ATT and every call / sms /
>>>>>> data useage you do is an event and then it needs  to calculate your bill 
>>>>>> on
>>>>>> real time basis so when you login to your account you can see all those
>>>>>> variable as how much you used and how much is left and what is your bill
>>>>>> till date ,Also there are different rules which need to be considered 
>>>>>> when
>>>>>> you calculate the total bill one simple rule will be 0-500 min it is free
>>>>>> but above it is $1 a min.
>>>>>>
>>>>>> How do i maintain a shared state  ( total amount , total min , total
>>>>>> data etc ) so that i know how much i accumulated at any given point as
>>>>>> events for same phone can go to any node / executor.
>>>>>>
>>>>>> Can some one please tell me how can i achieve this is spark as in
>>>>>> storm i can have a bolt which can do this ?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I guess both. In terms of syntax, I was comparing it with Trident.
>>>>>>>
>>>>>>> If you are joining, Spark Streaming actually does offer windowed
>>>>>>> join out of the box. We couldn't use this though as our event stream can
>>>>>>> grow "out-of-sync", so we had to implement something on top of Storm. If
>>>>>>> your event streams don't become out of sync, you may find the built-in 
>>>>>>> join
>>>>>>> in Spark Streaming useful. Storm also has a join keyword but its 
>>>>>>> semantics
>>>>>>> are different.
>>>>>>>
>>>>>>>
>>>>>>> > Also, what do you mean by "No Back Pressure" ?
>>>>>>>
>>>>>>> So when a topology is overloaded, Storm is designed so that it will
>>>>>>> stop reading from the source. Spark on the other hand, will keep reading
>>>>>>> from the source and spilling it internally. This maybe fine, in 
>>>>>>> fairness,
>>>>>>> but it does mean you have to worry about the persistent store usage in 
>>>>>>> the
>>>>>>> processing cluster, whereas with Storm you don't have to worry because 
>>>>>>> the
>>>>>>> messages just remain in the data store.
>>>>>>>
>>>>>>> Spark came up with the idea of rate limiting, but I don't feel this
>>>>>>> is as nice as back pressure because it's very difficult to tune it such
>>>>>>> that you don't cap the cluster's processing power but yet so that it 
>>>>>>> will
>>>>>>> prevent the persistent storage to get used up.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> When you say Storm, did you mean Storm with Trident or Storm?
>>>>>>>>
>>>>>>>> My use case does not have simple transformation. There are complex
>>>>>>>> events that need to be generated by joining the incoming event stream.
>>>>>>>>
>>>>>>>> Also, what do you mean by "No Back PRessure" ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> We've evaluated Spark Streaming vs. Storm and ended up sticking
>>>>>>>> with Storm.
>>>>>>>>
>>>>>>>> Some of the important draw backs are:
>>>>>>>> Spark has no back pressure (receiver rate limit can alleviate this
>>>>>>>> to a certain point, but it's far from ideal)
>>>>>>>> There is also no exactly-once semantics. (updateStateByKey can
>>>>>>>> achieve this semantics, but is not practical if you have any 
>>>>>>>> significant
>>>>>>>> amount of state because it does so by dumping the entire state on every
>>>>>>>> checkpointing)
>>>>>>>>
>>>>>>>> There are also some minor drawbacks that I'm sure will be fixed
>>>>>>>> quickly, like no task timeout, not being able to read from Kafka using
>>>>>>>> multiple nodes, data loss hazard with Kafka.
>>>>>>>>
>>>>>>>> It's also not possible to attain very low latency in Spark, if
>>>>>>>> that's what you need.
>>>>>>>>
>>>>>>>> The pos for Spark is the concise and IMO more intuitive syntax,
>>>>>>>> especially if you compare it with Storm's Java API.
>>>>>>>>
>>>>>>>> I admit I might be a bit biased towards Storm tho as I'm more
>>>>>>>> familiar with it.
>>>>>>>>
>>>>>>>> Also, you can do some processing with Kinesis. If all you need to
>>>>>>>> do is straight forward transformation and you are reading from Kinesis 
>>>>>>>> to
>>>>>>>> begin with, it might be an easier option to just do the transformation 
>>>>>>>> in
>>>>>>>> Kinesis.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>> Whatever you write in bolts would be the logic you want to apply on
>>>>>>>> your events. In Spark, that logic would be coded in map() or similar 
>>>>>>>> such
>>>>>>>> transformations and/or actions. Spark doesn't enforce a structure for
>>>>>>>> capturing your processing logic like Storm does.
>>>>>>>> Regards
>>>>>>>> Sab
>>>>>>>> Probably overloading the question a bit.
>>>>>>>>
>>>>>>>> In Storm, Bolts have the functionality of getting triggered on
>>>>>>>> events. Is that kind of functionality possible with Spark streaming? 
>>>>>>>> During
>>>>>>>> each phase of the data processing, the transformed data is stored to 
>>>>>>>> the
>>>>>>>> database and this transformed data should then be sent to a new 
>>>>>>>> pipeline
>>>>>>>> for further processing
>>>>>>>>
>>>>>>>> How can this be achieved using Spark?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>> I have a use-case where a stream of Incoming events have to be
>>>>>>>> aggregated and joined to create Complex events. The aggregation will 
>>>>>>>> have
>>>>>>>> to happen at an interval of 1 minute (or less).
>>>>>>>>
>>>>>>>> The pipeline is :
>>>>>>>>                                   send events
>>>>>>>>                    enrich event
>>>>>>>> Upstream services -------------------> KAFKA ---------> event
>>>>>>>> Stream Processor ------------> Complex Event Processor ------------>
>>>>>>>> Elastic Search.
>>>>>>>>
>>>>>>>> From what I understand, Storm will make a very good ESP and Spark
>>>>>>>> Streaming will make a good CEP.
>>>>>>>>
>>>>>>>> But, we are also evaluating Storm with Trident.
>>>>>>>>
>>>>>>>> How does Spark Streaming compare with Storm with Trident?
>>>>>>>>
>>>>>>>> Sridhar Chellappa
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   On Wednesday, 17 June 2015 10:02 AM, ayan guha <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I have a similar scenario where we need to bring data from kinesis
>>>>>>>> to hbase. Data volecity is 20k per 10 mins. Little manipulation of data
>>>>>>>> will be required but that's regardless of the tool so we will be 
>>>>>>>> writing
>>>>>>>> that piece in Java pojo.
>>>>>>>> All env is on aws. Hbase is on a long running EMR and kinesis on a
>>>>>>>> separate cluster.
>>>>>>>> TIA.
>>>>>>>> Best
>>>>>>>> Ayan
>>>>>>>> On 17 Jun 2015 12:13, "Will Briggs" <[email protected]> wrote:
>>>>>>>>
>>>>>>>> The programming models for the two frameworks are conceptually
>>>>>>>> rather different; I haven't worked with Storm for quite some time, but
>>>>>>>> based on my old experience with it, I would equate Spark Streaming more
>>>>>>>> with Storm's Trident API, rather than with the raw Bolt API. Even then,
>>>>>>>> there are significant differences, but it's a bit closer.
>>>>>>>>
>>>>>>>> If you can share your use case, we might be able to provide better
>>>>>>>> guidance.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Will
>>>>>>>>
>>>>>>>> On June 16, 2015, at 9:46 PM, [email protected] wrote:
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I am evaluating spark VS storm ( spark streaming  ) and i am not
>>>>>>>> able to see what is equivalent of Bolt in storm inside spark.
>>>>>>>>
>>>>>>>> Any help will be appreciated on this ?
>>>>>>>>
>>>>>>>> Thanks ,
>>>>>>>> Ashish
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark or Storm

Reply via email to