>>
>>> --
>>> bit1...@163.com
>>>
>>>
>>> *From:* Haopu Wang
>>> *Date:* 2015-06-19 18:47
>>> *To:* Enno Shioji ; Tathagata Das
>>>
>>> *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger ;
9 18:47
>> *To:* Enno Shioji ; Tathagata Das
>>
>> *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger ;
>> bit1...@163.com; Jordan Pilat ; Will Briggs
>> ; Ashish Soni ; ayan guha
>> ; user@spark.apache.org; Sateesh Kavuri
>> ; Spark Enthusiast ;
>> Sa
> ; Ashish Soni ; ayan guha
> ; user@spark.apache.org; Sateesh Kavuri
> ; Spark Enthusiast ;
> Sabarish
> Sasidharan
> *Subject:* RE: RE: Spark or Storm
>
> My question is not directly related: about the "exactly-once semantic",
> the document (copied below) said s
Soni ; ayan guha
> ; user@spark.apache.org; Sateesh Kavuri
> ; Spark Enthusiast ;
> Sabarish
> Sasidharan
> *Subject:* RE: RE: Spark or Storm
>
> My question is not directly related: about the "exactly-once semantic",
> the document (copied below) said spark streamin
.apache.org; Sateesh
Kavuri; Spark Enthusiast; Sabarish Sasidharan
Subject: RE: RE: Spark or Storm
My question is not directly related: about the "exactly-once semantic", the
document (copied below) said spark streaming gives exactly-once semantic, but
actually from my test result, with ch
ata Das
Cc: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com;
Jordan Pilat; Will Briggs; Ashish Soni; ayan guha;
user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish
Sasidharan
Subject: Re: RE: Spark or Storm
Fair enough, on second thought, just saying that it should be idempotent
use of checkpoints to persist the Kafka offsets in Spark
>>> Streaming itself, and not in zookeeper.
>>>
>>>
>>>
>>> Also this statement:”.. This allows one to build a Spark Streaming +
>>> Kafka pipelines with end-to-end exactly-once semantics (if
dempotent or transactional).”
>>
>>
>>
>>
>>
>> *From:* Cody Koeninger [mailto:c...@koeninger.org]
>> *Sent:* 18 June 2015 19:38
>> *To:* bit1...@163.com
>> *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com;
>> eshi...@gmail.com; wrbri...@gma
To:* bit1...@163.com
> *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com;
> eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha;
> user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
> sabarish.sasidha...@manthan.com
> *Subject:* Re: RE: Spark or Sto
(WT01 - BAS); jrpi...@gmail.com; eshi...@gmail.com;
wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; user;
sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
sabarish.sasidha...@manthan.com
Subject: Re: RE: Spark or Storm
That general description is accurate, but not really a specific
---
> bit1...@163.com
>
>
> *From:* prajod.vettiyat...@wipro.com
> *Date:* 2015-06-18 16:56
> *To:* jrpi...@gmail.com; eshi...@gmail.com
> *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com;
> user@spark.apache.org; sateesh.kav...@gma
jrpi...@gmail.com; eshi...@gmail.com
CC: wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com;
user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
sabarish.sasidha...@manthan.com
Subject: RE: Spark or Storm
>>not being able to read from Kafka using multiple nodes
nthusiast; Sabarish Sasidharan
Subject: Re: Spark or Storm
>not being able to read from Kafka using multiple nodes
Kafka is plenty capable of doing this, by clustering together multiple
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can
>not being able to read from Kafka using multiple nodes
Kafka is plenty capable of doing this, by clustering together multiple
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume
the topic in a parallelized fashion.
If it isn't, you
To add more information beyond what Matei said and answer the original
question, here are other things to consider when comparing between Spark
Streaming and Storm.
* Unified programming model and semantics - Most occasions you have to
process the same data again in batch jobs. If you have two sep
The major difference is that in Spark Streaming, there's no *need* for a
TridentState for state inside your computation. All the stateful operations
(reduceByWindow, updateStateByKey, etc) automatically handle exactly-once
processing, keeping updates in order, etc. Also, you don't need to run a
nown Parallel Programming Patterns especially suitable for
streaming data
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Wednesday, June 17, 2015 7:14 PM
To: Enno Shioji
Cc: Ashish Soni; ayan guha; Sabarish Sasidharan; Spark Enthusiast; Will
Briggs; user; Sateesh Kavuri
Subject: Re
Hi Matei,
Ah, can't get more accurate than from the horse's mouth... If you don't
mind helping me understand it correctly..
>From what I understand, Storm Trident does the following (when used with
Kafka):
1) Sit on Kafka Spout and create batches
2) Assign global sequential ID to the batches
3)
This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow
to keep track of a running count) is exactly-once. When you write to a storage
system, no matter which streaming framework you use, you'll have
Again, by Storm, you mean Storm Trident, correct?
On Wednesday, 17 June 2015 10:09 PM, Michael Segel
wrote:
Actually the reverse.
Spark Streaming is really a micro batch system where the smallest window is 1/2
a second (500ms). So for CEP, its not really a good idea.
So in terms o
Actually the reverse.
Spark Streaming is really a micro batch system where the smallest window is 1/2
a second (500ms).
So for CEP, its not really a good idea.
So in terms of options…. spark streaming, storm, samza, akka and others…
Storm is probably the easiest to pick up, spark streaming
The thing is, even with that improvement, you still have to make updates
idempotent or transactional yourself. If you read
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
that refers to the latest version, it says:
Semantics of output operations
Out
@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.
Any feedback will be helpful if anyone is tried the same.
http://koeninger.github.io/kafka-exactly-once/#7
https://databricks.com/blog
AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus
additionally, elastic scaling unlike Storm), Kinesis providing the
coordination. My understanding is that it's like a naked Storm worker
process that can consequently only do map.
I haven't really used it tho, so can't rea
Thanks for this. It's kcl based kinesis application. But because its just a
Java application we are thinking to use spark on EMR or storm for fault
tolerance and load balancing. Is it a correct approach?
On 17 Jun 2015 23:07, "Enno Shioji" wrote:
> Hi Ayan,
>
> Admittedly I haven't done much with
Processing stuff in batch is not the same thing as being transactional. If
you look at Storm, it will e.g. skip tuples that were already applied to a
state to avoid counting stuff twice etc. Spark doesn't come with such
facility, so you could end up counting twice etc.
On Wed, Jun 17, 2015 at 2:
Stream can also be processed in micro-batch / batches which is the main
reason behind Spark Steaming so what is the difference ?
Ashish
On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji wrote:
> PS just to elaborate on my first sentence, the reason Spark (not
> streaming) can offer exactly once sema
Hi Ayan,
Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their "processor" interface for that. In this
example, it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/se
PS just to elaborate on my first sentence, the reason Spark (not streaming)
can offer exactly once semantics is because its update operation is
idempotent. This is easy to do in a batch context because the input is
finite, but it's harder in streaming context.
On Wed, Jun 17, 2015 at 2:00 PM, Enno
So Spark (not streaming) does offer exactly once. Spark Streaming however,
can only do exactly once semantics *if the update operation is idempotent*.
updateStateByKey's update operation is idempotent, because it completely
replaces the previous state.
So as long as you use Spark streaming, you mu
As per my Best Understanding Spark Streaming offer Exactly once processing
, is this achieve only through updateStateByKey or there is another way to
do the same.
Ashish
On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji wrote:
> In that case I assume you need exactly once semantics. There's no
> out
In that case I assume you need exactly once semantics. There's no
out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
not practical with your use case as the state is too large (it'll try to
dump the entire intermediate state on every checkpoint, which would be
prohibitively
Great discussion!!
One qs about some comment: Also, you can do some processing with Kinesis.
If all you need to do is straight forward transformation and you are
reading from Kinesis to begin with, it might be an easier option to just do
the transformation in Kinesis
- Do you mean KCL application
My Use case is below
We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute
Consider you have a phone contract with ATT and every call / sms / data
useage you do is an event and then it needs to calculate your bill on real
time basis so
I guess both. In terms of syntax, I was comparing it with Trident.
If you are joining, Spark Streaming actually does offer windowed join out
of the box. We couldn't use this though as our event stream can grow
"out-of-sync", so we had to implement something on top of Storm. If your
event streams d
When you say Storm, did you mean Storm with Trident or Storm?
My use case does not have simple transformation. There are complex events that
need to be generated by joining the incoming event stream.
Also, what do you mean by "No Back PRessure" ?
On Wednesday, 17 June 2015 11:57 AM, Enno
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve t
Whatever you write in bolts would be the logic you want to apply on your
events. In Spark, that logic would be coded in map() or similar such
transformations and/or actions. Spark doesn't enforce a structure for
capturing your processing logic like Storm does.
Regards
Sab
Probably overloading the
Probably overloading the question a bit.
In Storm, Bolts have the functionality of getting triggered on events. Is
that kind of functionality possible with Spark streaming? During each phase
of the data processing, the transformed data is stored to the database and
this transformed data should the
I have a use-case where a stream of Incoming events have to be aggregated and
joined to create Complex events. The aggregation will have to happen at an
interval of 1 minute (or less).
The pipeline is : send events
enrich
I have a similar scenario where we need to bring data from kinesis to
hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
be required but that's regardless of the tool so we will be writing that
piece in Java pojo.
All env is on aws. Hbase is on a long running EMR and kinesis
The programming models for the two frameworks are conceptually rather
different; I haven't worked with Storm for quite some time, but based on my old
experience with it, I would equate Spark Streaming more with Storm's Trident
API, rather than with the raw Bolt API. Even then, there are signific
42 matches
Mail list logo