Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-20 Thread Guozhang Wang
amp; IoT* > *Microsoft Azure Advisor* > > Twitter : @ppatierno <http://twitter.com/ppatierno> > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> > Blog : DevExperience <http://paolopatierno.wordpress.com/> > > > --

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-18 Thread Paolo Patierno
p://it.linkedin.com/in/paolopatierno> Blog : DevExperience<http://paolopatierno.wordpress.com/> From: Michal Borowiecki Sent: Sunday, June 18, 2017 9:34 AM To: d...@kafka.apache.org; Jay Kreps Cc: users@kafka.apache.org; Matthias J. Sax Subject: Re: Kafka Streams vs Spark

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-18 Thread Michal Borowiecki
DevExperience<http://paolopatierno.wordpress.com/> <http://paolopatierno.wordpress.com/> ____ From: Eno Thereska <mailto:eno.there...@gmail.com> Sent: Thursday, June 15, 2017 3:57 PM To:users@kafka.apache.org <mailto:user

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-16 Thread Matthias J. Sax
; your article ! Thanks ! >>> >>> >>> Paolo >>> >>> >>> Paolo Patierno >>> Senior Software Engineer (IoT) @ Red Hat >>> Microsoft MVP on Windows Embedded & IoT >>> Microsoft Azure Advisor >>> >>> Twitter

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-16 Thread Jay Kreps
: @ppatierno<http://twitter.com/ppatierno> > <http://twitter.com/ppatierno> > Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno> > <http://it.linkedin.com/in/paolopatierno> > Blog : DevExperience<http://paolopatierno.wordpress.com/> >

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-16 Thread Jay Kreps
: @ppatierno<http://twitter.com/ppatierno> > <http://twitter.com/ppatierno> > Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno> > <http://it.linkedin.com/in/paolopatierno> > Blog : DevExperience<http://paolopatierno.wordpress.com/> >

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-16 Thread Michal Borowiecki
no<http://it.linkedin.com/in/paolopatierno> Blog : DevExperience<http://paolopatierno.wordpress.com/> From: Eno Thereska Sent: Thursday, June 15, 2017 1:45 PM To: users@kafka.apache.org Subject: Re: Kafka Streams vs Spark Streaming : reduce by window Hi Pa

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-15 Thread Matthias J. Sax
erno<http://twitter.com/ppatierno> > Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno> > Blog : DevExperience<http://paolopatierno.wordpress.com/> > > > > From: Eno Thereska > Sent: Thursday, June 15, 2017 3:57 PM > To: users@ka

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-15 Thread Paolo Patierno
ierno<http://twitter.com/ppatierno> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno> Blog : DevExperience<http://paolopatierno.wordpress.com/> From: Eno Thereska Sent: Thursday, June 15, 2017 3:57 PM To: users@kafka.apache.org Su

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-15 Thread Eno Thereska
o> > Blog : DevExperience<http://paolopatierno.wordpress.com/> > > > > From: Eno Thereska > Sent: Thursday, June 15, 2017 1:45 PM > To: users@kafka.apache.org > Subject: Re: Kafka Streams vs Spark Streaming : reduce by window >

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-15 Thread Paolo Patierno
Blog : DevExperience<http://paolopatierno.wordpress.com/> From: Eno Thereska Sent: Thursday, June 15, 2017 1:45 PM To: users@kafka.apache.org Subject: Re: Kafka Streams vs Spark Streaming : reduce by window Hi Paolo, That is indeed correct. We don’t believe in cl

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-15 Thread Eno Thereska
Hi Paolo, That is indeed correct. We don’t believe in closing windows in Kafka Streams. You could reduce the number of downstream records by using record caches: http://docs.confluent.io/current/streams/developer-guide.html#record-caches-in-the-dsl

Re: Kafka Streams vs Spark Streaming : reduce by window

2017-06-15 Thread Tom Bentley
It sounds like you want a tumbling time window, rather than a sliding window https://kafka.apache.org/documentation/streams#streams_dsl_windowing On 15 June 2017 at 14:38, Paolo Patierno wrote: > Hi, > > > using the streams library I noticed a difference (or there is a lack of > knowledge on my

Kafka Streams vs Spark Streaming : reduce by window

2017-06-15 Thread Paolo Patierno
Hi, using the streams library I noticed a difference (or there is a lack of knowledge on my side)with Apache Spark. Imagine following scenario ... I have a source topic where numeric values come in and I want to check the maximum value in the latest 5 seconds but ... putting the max value in

Re: Kafka Streams vs Spark Streaming

2017-03-01 Thread Matthias J. Sax
independently. >>>>>> >>>>>> Then you can run cluster of stream threads (same and multiple machines), >>>>>> each processing a partition. >>>>>> >>>>>> Having said this, we however run into lot of issues of frequent s

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Matthias J. Sax
a single machine. >>>>> Now we don't know if this is some bad VM configuration issue or some >>>>> problem with kafka streams/rocks db integration, we are still working on >>>>> that. >>>>> >>>>> So I would suggest if yo

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Steven Schlansker
t; Also make sure not to create big time windows and set a not so long >>>> retention time, so that state stores size is limited. >>>> >>>> We use a sliding 5 minutes window of size 10 minutes and retention of 30 >>>> minutes and see overall performance

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Matthias J. Sax
Tainji, Streams provides at-least-once processing guarantees. Thus, all flush/commits must be aligned -- otherwise, this guarantee might break. -Matthias On 2/28/17 6:40 AM, Damian Guy wrote: > Hi Tainji, > > The changelogs are flushed on the commit interval. It isn't currently > possible to

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Matthias J. Sax
tate stores). Other tools such as Flink or Spark work in >>> a >>>> similar fashion, there's no free lunch. >>>> >>>> One option, which you brought up above, is to disable the fault tolerance >>>> functionality for state by disabling th

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Steven Schlansker
ross the >>> network (from your app's state store changelogs to the Kafka cluster and >>> vice versa), though you may need to tune some parameters in your >> situation >>> because your key space has high cardinality and message volume per key is >>> re

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Damian Guy
Hi Tainji, The changelogs are flushed on the commit interval. It isn't currently possible to change this. Thanks, Damian On Tue, 28 Feb 2017 at 14:00 Tianji Li wrote: > Hi Guys, > > Thanks very much for your help. > > A final question, is it possible to use different commit intervals for > st

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Tianji Li
Hi Guys, Thanks very much for your help. A final question, is it possible to use different commit intervals for state-store change-logs topics and for sink topics? Thanks Tianji

Re: Kafka Streams vs Spark Streaming

2017-02-28 Thread Michael Noll
he "changelog" feature) of state > > stores: > > http://docs.confluent.io/current/streams/developer- > > guide.html#enable-disable-state-store-changelogs > > > > > I do have a Spark Cluster, but I am not convince how Spark Streaming > can > > do this

Re: Kafka Streams vs Spark Streaming

2017-02-27 Thread Sachin Mittal
fferently. > > Guozhang, could you comment anything regarding Kafka Streams vs Spark > Streaming, especially > > in terms of aggregations/groupbys/joins implementation logic? > > As you are hinting at yourself, if you want fault-tolerant state, then this > fault tolerance comes at

Re: Kafka Streams vs Spark Streaming

2017-02-27 Thread Guozhang Wang
Kohki, Thanks for the explanation, it's very helpful. As we have talked in another email thread you started, originally I thought the motivation to use "explicit triggers" (i.e. what it achieves with your watermark) was due to application logic, i.e. whenever you have received a record that trigg

Re: Kafka Streams vs Spark Streaming

2017-02-27 Thread Guozhang Wang
nt/streams/developer- > guide.html#enable-disable-state-store-changelogs > > > I do have a Spark Cluster, but I am not convince how Spark Streaming can > do this differently. > > Guozhang, could you comment anything regarding Kafka Streams vs Spark > Streaming, especially

Re: Kafka Streams vs Spark Streaming

2017-02-27 Thread Kohki Nishio
Guozhang, It's a bit difficult to explain, but let me try ... the basic idea is that we can assume most of messages have the same clock (per partition at least), then if an offset has information about metadata about the target time of the offset, fail-over works. Offset = 1 Metadata Time = 2/

Re: Kafka Streams vs Spark Streaming

2017-02-27 Thread Michael Noll
> I do have a Spark Cluster, but I am not convince how Spark Streaming can do this differently. > Guozhang, could you comment anything regarding Kafka Streams vs Spark Streaming, especially > in terms of aggregations/groupbys/joins implementation logic? As you are hinting at yourself, if y

Re: Kafka Streams vs Spark Streaming

2017-02-27 Thread Tianji Li
, so that the state stores are synced to brokers slower than default. I do have a Spark Cluster, but I am not convince how Spark Streaming can do this differently. Guozhang, could you comment anything regarding Kafka Streams vs Spark Streaming, especially in terms of aggregations/groupbys/joins

Re: Kafka Streams vs Spark Streaming

2017-02-26 Thread Guozhang Wang
Hello Kohki, Given your data traffic and the state volume I cannot think of a better solution but suggest using large number of partitioned local states. I'm wondering how would "per partition watermark" can help with your traffic? Guozhang On Sun, Feb 26, 2017 at 10:45 AM, Kohki Nishio wrote:

Re: Kafka Streams vs Spark Streaming

2017-02-26 Thread Guozhang Wang
Hello Tianji, As Kohki mentioned, in Streams joins and aggregations are always done pre-partitioned, and hence locally. So there won't be any inter-node communications needed to execute the join / aggregations. Also they can be hosted as persistent local state stores so you don't need to keep them

Re: Kafka Streams vs Spark Streaming

2017-02-26 Thread Kohki Nishio
Tianji, KStream is indeed Append mode as long as I do stateless processing, but when you do aggregation that is a stateful operation and it turns to KTable and that does Update mode. In regard to your aggregation, I believe Kafka's aggregation works for a single partition not over multiple partiti

Re: Kafka Streams vs Spark Streaming

2017-02-26 Thread Kohki Nishio
Guozhang, Let me explain what I'm trying to do. The message volume is large (TB per Day) and that is coming to a topic. Now I want to do per minute aggregation(Windowed) and send the output to the downstream (a topic) (Topic1 - Large Volume) -> [Stream App] -> (Topic2 - Large Volume) I assume th

Re: Kafka Streams vs Spark Streaming

2017-02-25 Thread Tianji Li
Hi Kohki, Thanks very much for providing your investigation results. Regarding 'append' mode with Kafka Streams, isn't KStream the thing you want? Hi Guozhang, Thanks for the pointers to the two blogs. I read one of them before and just had a look at the other one. What I am hoping to do i

Re: Kafka Streams vs Spark Streaming

2017-02-25 Thread Guozhang Wang
Hello Kohki, Thanks for the email. I'd like to learn what's your concern of the size of the state store? From your description it's a bit hard to figure out but I'd guess you have lots of state stores while each of them are relatively small? Hello Tianji, Regarding your question about maturity a

Re: Kafka Streams vs Spark Streaming

2017-02-25 Thread Kohki Nishio
I did a bit of research on that matter recently, the comparison is between Spark Structured Streaming(SSS) and Kafka Streams, Both are relatively new (~1y) and trying to solve similar problems, however if you go with Spark, you have to go with a cluster, if your environment already have a cluster,

Kafka Streams vs Spark Streaming

2017-02-24 Thread Tianji Li
Hi there, Can anyone give a good explanation in what cases Kafka Streams is preferred, and in what cases Sparking Streaming is better? Thanks Tianji