Hello there! I read about Kafka Streams recently, pretty interesting the way it solves the stream processing problem in a more cleaner way with less overheads and complexities.
I work as a Software Engineer in a startup, and we are in the design stage for building a stream processing pipeline (if you will) for the millions of events we get every day. We use Kafka (cluster) as the log aggregation layer already in production a 5-6 months back and very happy about it. I went through a few confluent blogs (by Jay, Neha) as to how KStreams solve for sort of a state-ful event processing, and maybe I missed the whole point in this regard, I have some doubts. We have use-cases like the following: There is an event E1, which is sort-of the base event after which we have a lot of sub- events E2,E3..En enriching E1 with lot of extra properties (with considerable delay, say 30-40 mins). Eg. 1: An order event has come in where the user has ordered an item on our website (This is the base event). After say 30-40 minutes, we get events like packaging_time, shipping_time, delivered_time or cancelled_time etc related to that order (These are the sub-events). So before we get the whole event to a warehouse, we need to collect all these (ordered, packaged, shipped, cancelled/delivered), and whenever I get a cancelled or delivered event for an order, I know that completes the lifecycle for that order, and can put it in the warehouse. Eg. 2: User login events - If we are to capture events like User-Logged-In, User-Logged-Out, I need it to be in the warehouse as a single row. Eg. row would have these columns *user_id, login_time, logout_time*. So as and when I receive a logout event (and if I have login event stored in some store), there would be a trigger which combines both, and send it across to the warehouse. All these involve storing the state of the events and act as-and-when another event (that completes lifecycle) occurs, after which you have a trigger for further steps (warehouse or anything else). Does KStream help me do this? If not, how should I go about solving this problem? Also, I wanted some advice as to whether it is a standard practice to aggregate like this and *then* store to warehouse, or should I append each event into the warehouse and do sort-of an ELT on that using the warehouse? (Run a query to re-structure the data in the database and store it off as a separate table) Eagerly waiting for your reply, Arvind