subject:"RE\: Spark or Storm"

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das

>> >>> -- >>> bit1...@163.com >>> >>> >>> *From:* Haopu Wang >>> *Date:* 2015-06-19 18:47 >>> *To:* Enno Shioji ; Tathagata Das >>> >>> *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger ;

Re: RE: Spark or Storm

2015-06-19 Thread Cody Koeninger

9 18:47 >> *To:* Enno Shioji ; Tathagata Das >> >> *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger ; >> bit1...@163.com; Jordan Pilat ; Will Briggs >> ; Ashish Soni ; ayan guha >> ; user@spark.apache.org; Sateesh Kavuri >> ; Spark Enthusiast ; >> Sa

Re: RE: Spark or Storm

2015-06-19 Thread Cody Koeninger

> ; Ashish Soni ; ayan guha > ; user@spark.apache.org; Sateesh Kavuri > ; Spark Enthusiast ; > Sabarish > Sasidharan > *Subject:* RE: RE: Spark or Storm > > My question is not directly related: about the "exactly-once semantic", > the document (copied below) said s

Re: RE: Spark or Storm

2015-06-19 Thread Ashish Soni

Soni ; ayan guha > ; user@spark.apache.org; Sateesh Kavuri > ; Spark Enthusiast ; > Sabarish > Sasidharan > *Subject:* RE: RE: Spark or Storm > > My question is not directly related: about the "exactly-once semantic", > the document (copied below) said spark streamin

Re: RE: Spark or Storm

2015-06-19 Thread bit1...@163.com

.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: RE: RE: Spark or Storm My question is not directly related: about the "exactly-once semantic", the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with ch

RE: RE: Spark or Storm

2015-06-19 Thread Haopu Wang

ata Das Cc: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: Re: RE: Spark or Storm Fair enough, on second thought, just saying that it should be idempotent

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji

use of checkpoints to persist the Kafka offsets in Spark >>> Streaming itself, and not in zookeeper. >>> >>> >>> >>> Also this statement:”.. This allows one to build a Spark Streaming + >>> Kafka pipelines with end-to-end exactly-once semantics (if

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das

dempotent or transactional).” >> >> >> >> >> >> *From:* Cody Koeninger [mailto:c...@koeninger.org] >> *Sent:* 18 June 2015 19:38 >> *To:* bit1...@163.com >> *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com; >> eshi...@gmail.com; wrbri...@gma

Re: RE: Spark or Storm

2015-06-18 Thread Enno Shioji

To:* bit1...@163.com > *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com; > eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; > user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; > sabarish.sasidha...@manthan.com > *Subject:* Re: RE: Spark or Sto

RE: RE: Spark or Storm

2015-06-18 Thread prajod.vettiyattil

(WT01 - BAS); jrpi...@gmail.com; eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com Subject: Re: RE: Spark or Storm That general description is accurate, but not really a specific

Re: RE: Spark or Storm

2015-06-18 Thread Cody Koeninger

--- > bit1...@163.com > > > *From:* prajod.vettiyat...@wipro.com > *Date:* 2015-06-18 16:56 > *To:* jrpi...@gmail.com; eshi...@gmail.com > *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com; > user@spark.apache.org; sateesh.kav...@gma

Re: RE: Spark or Storm

2015-06-18 Thread bit1...@163.com

jrpi...@gmail.com; eshi...@gmail.com CC: wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com; user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com Subject: RE: Spark or Storm >>not being able to read from Kafka using multiple nodes

RE: Spark or Storm

2015-06-18 Thread prajod.vettiyattil

nthusiast; Sabarish Sasidharan Subject: Re: Spark or Storm >not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can

Re: Spark or Storm

2015-06-17 Thread Jordan Pilat

>not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you

Re: Spark or Storm

2015-06-17 Thread Tathagata Das

To add more information beyond what Matei said and answer the original question, here are other things to consider when comparing between Spark Streaming and Storm. * Unified programming model and semantics - Most occasions you have to process the same data again in batch jobs. If you have two sep

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia

The major difference is that in Spark Streaming, there's no *need* for a TridentState for state inside your computation. All the stateful operations (reduceByWindow, updateStateByKey, etc) automatically handle exactly-once processing, keeping updates in order, etc. Also, you don't need to run a

RE: Spark or Storm

2015-06-17 Thread Evo Eftimov

nown Parallel Programming Patterns especially suitable for streaming data From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Wednesday, June 17, 2015 7:14 PM To: Enno Shioji Cc: Ashish Soni; ayan guha; Sabarish Sasidharan; Spark Enthusiast; Will Briggs; user; Sateesh Kavuri Subject: Re

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

Hi Matei, Ah, can't get more accurate than from the horse's mouth... If you don't mind helping me understand it correctly.. >From what I understand, Storm Trident does the following (when used with Kafka): 1) Sit on Kafka Spout and create batches 2) Assign global sequential ID to the batches 3)

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia

This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have

Re: Spark or Storm

2015-06-17 Thread Spark Enthusiast

Again, by Storm, you mean Storm Trident, correct? On Wednesday, 17 June 2015 10:09 PM, Michael Segel wrote: Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms o

Re: Spark or Storm

2015-06-17 Thread Michael Segel

Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics that refers to the latest version, it says: Semantics of output operations Out

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

@Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't rea

Re: Spark or Storm

2015-06-17 Thread ayan guha

Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, "Enno Shioji" wrote: > Hi Ayan, > > Admittedly I haven't done much with

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

Processing stuff in batch is not the same thing as being transactional. If you look at Storm, it will e.g. skip tuples that were already applied to a state to avoid counting stuff twice etc. Spark doesn't come with such facility, so you could end up counting twice etc. On Wed, Jun 17, 2015 at 2:

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

Stream can also be processed in micro-batch / batches which is the main reason behind Spark Steaming so what is the difference ? Ashish On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji wrote: > PS just to elaborate on my first sentence, the reason Spark (not > streaming) can offer exactly once sema

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their "processor" interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/se

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

PS just to elaborate on my first sentence, the reason Spark (not streaming) can offer exactly once semantics is because its update operation is idempotent. This is easy to do in a batch context because the input is finite, but it's harder in streaming context. On Wed, Jun 17, 2015 at 2:00 PM, Enno

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

So Spark (not streaming) does offer exactly once. Spark Streaming however, can only do exactly once semantics *if the update operation is idempotent*. updateStateByKey's update operation is idempotent, because it completely replaces the previous state. So as long as you use Spark streaming, you mu

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji wrote: > In that case I assume you need exactly once semantics. There's no > out

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be prohibitively

Re: Spark or Storm

2015-06-17 Thread ayan guha

Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow "out-of-sync", so we had to implement something on top of Storm. If your event streams d

Re: Spark or Storm

2015-06-17 Thread Spark Enthusiast

When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by "No Back PRessure" ? On Wednesday, 17 June 2015 11:57 AM, Enno

Re: Spark or Storm

2015-06-16 Thread Enno Shioji

We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve t

Re: Spark or Storm

2015-06-16 Thread Sabarish Sasidharan

Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the

Re: Spark or Storm

2015-06-16 Thread Sateesh Kavuri

Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should the

Re: Spark or Storm

2015-06-16 Thread Spark Enthusiast

I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich

Re: Spark or Storm

2015-06-16 Thread ayan guha

I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis

Re: Spark or Storm

2015-06-16 Thread Will Briggs

The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are signific

42 matches

Mail list logo