Re: Kafka Streams vs Spark Streaming

Tianji Li Sat, 25 Feb 2017 13:12:28 -0800

Hi Kohki,

Thanks very much for providing your investigation results. Regarding'append' mode with Kafka Streams, isn't KStream the thing you want?


Hi Guozhang,

Thanks for the pointers to the two blogs. I read one of them before andjust had a look at the other one.

What I am hoping to do is below, can you help me decide if Kafka Streamis a good fit?

We have a few data sources, and we are hoping to correlate thesesources, and then do aggregations, as *a stream in real-time*.

The number of aggregations is around 100 which means, if using KafkaStreams, we need to maintain around 100 state stores with 100 change-logtopics behind

the scene when joining and aggregations.

The number of unique entries in each of these state stores is expectedto be at the level of < 100M. The size of each record is around 1K bytesand so,each state is expected to be ~100G bytes in size. The total number ofbytes in all these state stores is thus around 10T bytes.

If keeping all these stores in memory, this translates into around 50machines with 256Gbytes for this purpose alone.

Plus, the incoming raw data rate could reach 10M records per second inpeak hours. So, during aggregation, data movement between Kafka Streamsinstanceswill be heavy, i.e., 10M records per second in the cluster for joiningand aggregations.

Is Kafka Streams good for this? My gut feeling is Kafka Streams is fine.But I'd like to run this by you.

And, I am hoping to minimize data movement (to saving bandwidth) duringjoins/groupBys. If I partition the raw data with the minimum subset ofaggregation keys (say K1 and K2), then I wonder if the followingjoins/groupBys (say on keys K1, K2, K3, K4) happen on local data, ifusing DSL?


Thanks
Tianji


On 2017-02-25 13:49 (-0500), Guozhang Wang <w...@gmail.com> wrote:
> Hello Kohki,>
>

> Thanks for the email. I'd like to learn what's your concern of thesize of>> the state store? From your description it's a bit hard to figure outbut>> I'd guess you have lots of state stores while each of them arerelatively>

> small?>
>
> Hello Tianji,>
>

> Regarding your question about maturity and users of Streams, you cantake a>

> look at a bunch of the blog posts written about their Streams usage in>
> production, for example:>
>

>http://engineering.skybettingandgaming.com/2017/01/23/streaming-architectures/>

>
> http://developers.linecorp.com/blog/?p=3960>
>
> Guozhang>
>
>
> On Sat, Feb 25, 2017 at 7:52 AM, Kohki Nishio <ta...@gmail.com> wrote:>
>

> > I did a bit of research on that matter recently, the comparison isbetween>

> > Spark Structured Streaming(SSS) and Kafka Streams,>
> >>

> > Both are relatively new (~1y) and trying to solve similar problems,however>> > if you go with Spark, you have to go with a cluster, if yourenvironment>> > already have a cluster, then it's good. However our team doesn't doany>> > Spark, so the initial cost would be very high. On the other hand,Kafka>> > Streams is a java library, since we have a service framework, doingstream>

> > inside a service is super easy.>
> >>

> > However for some reason, people see SSS is more mature and KafkaStreams is>> > not so mature (like Beta). But old fashion stream is both matureenough (in>

> > my opinion), I didn't see any difference in DStream(Spark) and>
> > KStream(Kafka)>
> >>

> > DataFrame (Structured Streaming) and KTable, I found it quitedifferent.>> > Kafka's model is more like a change log, that means you need to seethe>> > latest entry to make a final decision. I would call this as'Update' model,>> > whereas Spark does 'Append' model and it doesn't support 'Update'model>

> > yet. (it's coming to 2.2)>
> >>
> > http://spark.apache.org/docs/latest/structured-streaming-pro>
> > gramming-guide.html#output-modes>
> >>

> > I wanted to have 'Append' model with Kafka, but it seems it's noteasy>

> > thing to do, also Kafka Streams uses an internal topic to keep state>
> > changes for fail-over scenario, but I'm dealing with a lots of tiny>

> > information and I have a big concern about the size of the statestore />> > topic, so my decision is that I'm going with my own handling ofKafka API>

> > ..>
> >>

> > If you do stateless operation and don't have a spark cluster, yeahKafka>

> > Streams is perfect.>
> > If you do stateful complicated operation and happen to have a spark>
> > cluster, give Spark a try>
> > else you have to write a code which is optimized for your use case>
> >>
> >>
> > thanks>
> > -Kohki>
> >>
> >>
> >>
> >>
> > On Fri, Feb 24, 2017 at 6:22 PM, Tianji Li <sk...@gmail.com> wrote:>
> >>
> > > Hi there,>
> > >>
> > > Can anyone give a good explanation in what cases Kafka Streams is>
> > > preferred, and in what cases Sparking Streaming is better?>
> > >>
> > > Thanks>
> > > Tianji>
> > >>
> >>
> >>
> >>
> > -->
> > Kohki Nishio>
> >>
>
>
>
> -- >
> -- Guozhang>
>

Re: Kafka Streams vs Spark Streaming

Reply via email to