Xichen_tju,

I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU, 
24GBRAM with 3 servers) and was not able to achieve a realistic scale for my 
business domain needs.  Storm is really only a framework, which allows you to 
put in code to do whatever it is you need for a distributed system…so it’s 
completely flexible and distributable, but it comes at a price.  In Storm, the 
one of the biggest performance hits, came down to how the “acks” work within 
the tuple trees.  You can have the framework default ack messages between 
spouts and/or bolts, but in the end, you most likely want to manage acks 
yourself, due to how much reliability you’re system will need (to replay 
messages…).  All this means, is that if you don’t have massive amounts of data 
that you need to process within a few seconds, (which I do) then Storm may work 
well for you, but you’re performance will diminish as you add in more and more 
business rules (unless of course you add in more servers for processing).  If 
you need to ingest at least 1GBps+, then you may want to reevaluate since 
you’re server scale may not mesh with you overall processing needs.

I recently just started using Spark Streaming with Kafka and have been quite 
impressed at the performance level that’s being achieved.  I particularly like 
the fact that Spark isn’t just a framework, but it provides you with simple 
tools with API convenience methods.  Some of those features are reduceByKey 
(mapReduce), sliding and aggregate sub time windows, etc.  Also, In my 
environment, I believe it’s going to be a great fit since we use Hadoop already 
and Spark should fit into that environment well.

You should look into both Storm and Spark Streaming, but in the end it just 
depends on your needs.  If you not looking for Streaming aspects, then Spark on 
Hadoop is a great option since Spark will cache the dataset in memory for all 
queries, which will be much faster than running Hive/Pig onto of Hadoop.  But 
I’m assuming you need some sort of Streaming system for data flow, but if it 
doesn’t need to be real-time or near real-time, you may want to simply look at 
Hadoop, which you could always use Spark ontop of for real-time queries.

Hope this helps…

Dan

 
On Jul 8, 2014, at 7:25 PM, Shao, Saisai <saisai.s...@intel.com> wrote:

> You may get the performance comparison results from Spark Streaming paper and 
> meetup ppt, just google it.
> Actually performance comparison is case by case and relies on your work load 
> design, hardware and software configurations. There is no actual winner for 
> the whole scenarios.
>  
> Thanks
> Jerry
>  
> From: xichen_tju@126 [mailto:xichen_...@126.com] 
> Sent: Wednesday, July 09, 2014 9:17 AM
> To: user@spark.apache.org
> Subject: Spark Streaming and Storm
>  
> hi all
> I am a newbie to Spark Streaming, and used Strom before.Have u test the 
> performance both of them and which one is better?
>  
> xichen_tju@126

Reply via email to