Re: Kafka Streams vs Spark Streaming

Kohki Nishio Sun, 26 Feb 2017 10:57:40 -0800

Tianji,
KStream is indeed Append mode as long as I do stateless processing, but
when you do aggregation that is a stateful operation and it turns to KTable
and that does Update mode.


In regard to your aggregation, I believe Kafka's aggregation works for a
single partition not over multiple partitions, are you doing 100
different aggregation against record key ? Then you should have a single
data object for those 100 values, anyway it sounds like we have similar
problem ..

-Kohki





On Sat, Feb 25, 2017 at 1:11 PM, Tianji Li <skyah...@gmail.com> wrote:

> Hi Kohki,
>
> Thanks very much for providing your investigation results. Regarding
> 'append' mode with Kafka Streams, isn't KStream the thing you want?
>
> Hi Guozhang,
>
> Thanks for the pointers to the two blogs. I read one of them before and
> just had a look at the other one.
>
> What I am hoping to do is below, can you help me decide if Kafka Stream is
> a good fit?
>
> We have a few data sources, and we are hoping to correlate these sources,
> and then do aggregations, as *a stream in real-time*.
>
> The number of aggregations is around 100 which means, if using Kafka
> Streams, we need to maintain around 100 state stores with 100 change-log
> topics behind
> the scene when joining and aggregations.
>
> The number of unique entries in each of these state stores is expected to
> be at the level of < 100M. The size of each record is around 1K bytes and
> so,
> each state is expected to be ~100G bytes in size. The total number of
> bytes in all these state stores is thus around 10T bytes.
>
> If keeping all these stores in memory, this translates into around 50
> machines with 256Gbytes for this purpose alone.
>
> Plus, the incoming raw data rate could reach 10M records per second in
> peak hours. So, during aggregation, data movement between Kafka Streams
> instances
> will be heavy, i.e., 10M records per second in the cluster for joining and
> aggregations.
>
> Is Kafka Streams good for this? My gut feeling is Kafka Streams is fine.
> But I'd like to run this by you.
>
> And, I am hoping to minimize data movement (to saving bandwidth) during
> joins/groupBys. If I partition the raw data with the minimum subset of
> aggregation keys (say K1 and K2),  then I wonder if the following
> joins/groupBys (say on keys K1, K2, K3, K4) happen on local data, if using
> DSL?
>
> Thanks
> Tianji
>
>
> On 2017-02-25 13:49 (-0500), Guozhang Wang <w...@gmail.com> wrote:
> > Hello Kohki,>
> >
> > Thanks for the email. I'd like to learn what's your concern of the size
> of>
> > the state store? From your description it's a bit hard to figure out but>
> > I'd guess you have lots of state stores while each of them are
> relatively>
> > small?>
> >
> > Hello Tianji,>
> >
> > Regarding your question about maturity and users of Streams, you can
> take a>
> > look at a bunch of the blog posts written about their Streams usage in>
> > production, for example:>
> >
> > http://engineering.skybettingandgaming.com/2017/01/23/
> streaming-architectures/>
> >
> > http://developers.linecorp.com/blog/?p=3960>
> >
> > Guozhang>
> >
> >
> > On Sat, Feb 25, 2017 at 7:52 AM, Kohki Nishio <ta...@gmail.com> wrote:>
> >
> > > I did a bit of research on that matter recently, the comparison is
> between>
> > > Spark Structured Streaming(SSS) and Kafka Streams,>
> > >>
> > > Both are relatively new (~1y) and trying to solve similar problems,
> however>
> > > if you go with Spark, you have to go with a cluster, if your
> environment>
> > > already have a cluster, then it's good. However our team doesn't do
> any>
> > > Spark, so the initial cost would be very high. On the other hand,
> Kafka>
> > > Streams is a java library, since we have a service framework, doing
> stream>
> > > inside a service is super easy.>
> > >>
> > > However for some reason, people see SSS is more mature and Kafka
> Streams is>
> > > not so mature (like Beta). But old fashion stream is both mature
> enough (in>
> > > my opinion), I didn't see any difference in DStream(Spark) and>
> > > KStream(Kafka)>
> > >>
> > > DataFrame (Structured Streaming) and KTable, I found it quite
> different.>
> > > Kafka's model is more like a change log, that means you need to see
> the>
> > > latest entry to make a final decision. I would call this as 'Update'
> model,>
> > > whereas Spark does 'Append' model and it doesn't support 'Update'
> model>
> > > yet. (it's coming to 2.2)>
> > >>
> > > http://spark.apache.org/docs/latest/structured-streaming-pro>
> > > gramming-guide.html#output-modes>
> > >>
> > > I wanted to have 'Append' model with Kafka, but it seems it's not easy>
> > > thing to do, also Kafka Streams uses an internal topic to keep state>
> > > changes for fail-over scenario, but I'm dealing with a lots of tiny>
> > > information and I have a big concern about the size of the state store
> />
> > > topic, so my decision is that I'm going with my own handling of Kafka
> API>
> > > ..>
> > >>
> > > If you do stateless operation and don't have a spark cluster, yeah
> Kafka>
> > > Streams is perfect.>
> > > If you do stateful complicated operation and happen to have a spark>
> > > cluster, give Spark a try>
> > > else you have to write a code which is optimized for your use case>
> > >>
> > >>
> > > thanks>
> > > -Kohki>
> > >>
> > >>
> > >>
> > >>
> > > On Fri, Feb 24, 2017 at 6:22 PM, Tianji Li <sk...@gmail.com> wrote:>
> > >>
> > > > Hi there,>
> > > >>
> > > > Can anyone give a good explanation in what cases Kafka Streams is>
> > > > preferred, and in what cases Sparking Streaming is better?>
> > > >>
> > > > Thanks>
> > > > Tianji>
> > > >>
> > >>
> > >>
> > >>
> > > -->
> > > Kohki Nishio>
> > >>
> >
> >
> >
> > -- >
> > -- Guozhang>
> >
>



-- 
Kohki Nishio

Re: Kafka Streams vs Spark Streaming

Reply via email to