Tianji, KStream is indeed Append mode as long as I do stateless processing, but when you do aggregation that is a stateful operation and it turns to KTable and that does Update mode.
In regard to your aggregation, I believe Kafka's aggregation works for a single partition not over multiple partitions, are you doing 100 different aggregation against record key ? Then you should have a single data object for those 100 values, anyway it sounds like we have similar problem .. -Kohki On Sat, Feb 25, 2017 at 1:11 PM, Tianji Li <skyah...@gmail.com> wrote: > Hi Kohki, > > Thanks very much for providing your investigation results. Regarding > 'append' mode with Kafka Streams, isn't KStream the thing you want? > > Hi Guozhang, > > Thanks for the pointers to the two blogs. I read one of them before and > just had a look at the other one. > > What I am hoping to do is below, can you help me decide if Kafka Stream is > a good fit? > > We have a few data sources, and we are hoping to correlate these sources, > and then do aggregations, as *a stream in real-time*. > > The number of aggregations is around 100 which means, if using Kafka > Streams, we need to maintain around 100 state stores with 100 change-log > topics behind > the scene when joining and aggregations. > > The number of unique entries in each of these state stores is expected to > be at the level of < 100M. The size of each record is around 1K bytes and > so, > each state is expected to be ~100G bytes in size. The total number of > bytes in all these state stores is thus around 10T bytes. > > If keeping all these stores in memory, this translates into around 50 > machines with 256Gbytes for this purpose alone. > > Plus, the incoming raw data rate could reach 10M records per second in > peak hours. So, during aggregation, data movement between Kafka Streams > instances > will be heavy, i.e., 10M records per second in the cluster for joining and > aggregations. > > Is Kafka Streams good for this? My gut feeling is Kafka Streams is fine. > But I'd like to run this by you. > > And, I am hoping to minimize data movement (to saving bandwidth) during > joins/groupBys. If I partition the raw data with the minimum subset of > aggregation keys (say K1 and K2), then I wonder if the following > joins/groupBys (say on keys K1, K2, K3, K4) happen on local data, if using > DSL? > > Thanks > Tianji > > > On 2017-02-25 13:49 (-0500), Guozhang Wang <w...@gmail.com> wrote: > > Hello Kohki,> > > > > Thanks for the email. I'd like to learn what's your concern of the size > of> > > the state store? From your description it's a bit hard to figure out but> > > I'd guess you have lots of state stores while each of them are > relatively> > > small?> > > > > Hello Tianji,> > > > > Regarding your question about maturity and users of Streams, you can > take a> > > look at a bunch of the blog posts written about their Streams usage in> > > production, for example:> > > > > http://engineering.skybettingandgaming.com/2017/01/23/ > streaming-architectures/> > > > > http://developers.linecorp.com/blog/?p=3960> > > > > Guozhang> > > > > > > On Sat, Feb 25, 2017 at 7:52 AM, Kohki Nishio <ta...@gmail.com> wrote:> > > > > > I did a bit of research on that matter recently, the comparison is > between> > > > Spark Structured Streaming(SSS) and Kafka Streams,> > > >> > > > Both are relatively new (~1y) and trying to solve similar problems, > however> > > > if you go with Spark, you have to go with a cluster, if your > environment> > > > already have a cluster, then it's good. However our team doesn't do > any> > > > Spark, so the initial cost would be very high. On the other hand, > Kafka> > > > Streams is a java library, since we have a service framework, doing > stream> > > > inside a service is super easy.> > > >> > > > However for some reason, people see SSS is more mature and Kafka > Streams is> > > > not so mature (like Beta). But old fashion stream is both mature > enough (in> > > > my opinion), I didn't see any difference in DStream(Spark) and> > > > KStream(Kafka)> > > >> > > > DataFrame (Structured Streaming) and KTable, I found it quite > different.> > > > Kafka's model is more like a change log, that means you need to see > the> > > > latest entry to make a final decision. I would call this as 'Update' > model,> > > > whereas Spark does 'Append' model and it doesn't support 'Update' > model> > > > yet. (it's coming to 2.2)> > > >> > > > http://spark.apache.org/docs/latest/structured-streaming-pro> > > > gramming-guide.html#output-modes> > > >> > > > I wanted to have 'Append' model with Kafka, but it seems it's not easy> > > > thing to do, also Kafka Streams uses an internal topic to keep state> > > > changes for fail-over scenario, but I'm dealing with a lots of tiny> > > > information and I have a big concern about the size of the state store > /> > > > topic, so my decision is that I'm going with my own handling of Kafka > API> > > > ..> > > >> > > > If you do stateless operation and don't have a spark cluster, yeah > Kafka> > > > Streams is perfect.> > > > If you do stateful complicated operation and happen to have a spark> > > > cluster, give Spark a try> > > > else you have to write a code which is optimized for your use case> > > >> > > >> > > > thanks> > > > -Kohki> > > >> > > >> > > >> > > >> > > > On Fri, Feb 24, 2017 at 6:22 PM, Tianji Li <sk...@gmail.com> wrote:> > > >> > > > > Hi there,> > > > >> > > > > Can anyone give a good explanation in what cases Kafka Streams is> > > > > preferred, and in what cases Sparking Streaming is better?> > > > >> > > > > Thanks> > > > > Tianji> > > > >> > > >> > > >> > > >> > > > --> > > > Kohki Nishio> > > >> > > > > > > > > -- > > > -- Guozhang> > > > -- Kohki Nishio