Arvind, your use cases sound very similar to what Bobby Calderwell recently described and presented at StrangeLoop this year:
Commander: Better Distributed Applications through CQRS, Event Sourcing, and Immutable Logs https://speakerdeck.com/bobbycalderwood/commander-better-distributed-applications-through-cqrs-event-sourcing-and-immutable-logs (The link also includes pointers to the recorded talk and example code.) See also http://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/ for a higher-level introduction into such architectures. Hope this helps! Michael On Mon, Sep 26, 2016 at 11:02 AM, Michael Noll <mich...@confluent.io> wrote: > PS: Arvind, I think typically such questions should be sent to the > kafka-user mailing list (not to kafka-dev). > > On Mon, Sep 26, 2016 at 3:41 AM, Matthias J. Sax <matth...@confluent.io> > wrote: > >> Hi Arvind, >> >> short answer, Kafka Streams does definitely help you! >> >> Long answer, Kafka Streams offers two layer to program your stream >> processing job. The low-level Processor API and the high level DSL. >> Please check the documentation to get further details: >> http://docs.confluent.io/3.0.1/streams/index.html >> >> With Processor API you are able to do anything -- on the cost of lower >> abstraction and thus more coding. I guess, this would be there best way >> for your use case to program your Kafka Streams application. >> >> The DSL is easier to use and provides high level abstractions -- >> however, I am not sure if it covers what you need in your use case. But >> maybe it's worth to try it out before using Processor API... >> >> For your second question, I would recommend to use Processor API an >> attach a state store to a processor node and write to your data >> warehouse whenever a state is "complete" (see >> http://docs.confluent.io/3.0.1/streams/developer-guide.html# >> defining-a-state-store). >> >> One more hint: you can actually mix DSL and Processor API by using (eg. >> process() or transform() within DSL). >> >> >> Hope this gives you some initial pointers. Please follow up if you have >> more questions. >> >> -Matthias >> >> >> On 09/24/2016 11:25 PM, Arvind Kalyan wrote: >> > Hello there! >> > >> > I read about Kafka Streams recently, pretty interesting the way it >> solves >> > the stream processing problem in a more cleaner way with less overheads >> and >> > complexities. >> > >> > I work as a Software Engineer in a startup, and we are in the design >> stage >> > for building a stream processing pipeline (if you will) for the >> millions of >> > events we get every day. We use Kafka (cluster) as the log aggregation >> > layer already in production a 5-6 months back and very happy about it. >> > >> > I went through a few confluent blogs (by Jay, Neha) as to how KStreams >> > solve for sort of a state-ful event processing, and maybe I missed the >> > whole point in this regard, I have some doubts. >> > >> > We have use-cases like the following: >> > >> > There is an event E1, which is sort-of the base event after which we >> have a >> > lot of sub- events E2,E3..En enriching E1 with lot of extra properties >> > (with considerable delay, say 30-40 mins). >> > >> > Eg. 1: An order event has come in where the user has ordered an item on >> our >> > website (This is the base event). After say 30-40 minutes, we get events >> > like packaging_time, shipping_time, delivered_time or cancelled_time etc >> > related to that order (These are the sub-events). >> > >> > So before we get the whole event to a warehouse, we need to collect all >> > these (ordered, packaged, shipped, cancelled/delivered), and whenever I >> get >> > a cancelled or delivered event for an order, I know that completes the >> > lifecycle for that order, and can put it in the warehouse. >> > >> > Eg. 2: User login events - If we are to capture events like >> User-Logged-In, >> > User-Logged-Out, I need it to be in the warehouse as a single row. Eg. >> row >> > would have these columns *user_id, login_time, logout_time*. So as and >> when >> > I receive a logout event (and if I have login event stored in some >> store), >> > there would be a trigger which combines both, and send it across to the >> > warehouse. >> > >> > All these involve storing the state of the events and act as-and-when >> > another event (that completes lifecycle) occurs, after which you have a >> > trigger for further steps (warehouse or anything else). >> > >> > Does KStream help me do this? If not, how should I go about solving this >> > problem? >> > >> > Also, I wanted some advice as to whether it is a standard practice to >> > aggregate like this and *then* store to warehouse, or should I append >> each >> > event into the warehouse and do sort-of an ELT on that using the >> warehouse? >> > (Run a query to re-structure the data in the database and store it off >> as a >> > separate table) >> > >> > Eagerly waiting for your reply, >> > Arvind >> > >> >> > > >