Hi Arvind,

short answer, Kafka Streams does definitely help you!

Long answer, Kafka Streams offers two layer to program your stream
processing job. The low-level Processor API and the high level DSL.
Please check the documentation to get further details:
http://docs.confluent.io/3.0.1/streams/index.html

With Processor API you are able to do anything -- on the cost of lower
abstraction and thus more coding. I guess, this would be there best way
for your use case to program your Kafka Streams application.

The DSL is easier to use and provides high level abstractions --
however, I am not sure if it covers what you need in your use case. But
maybe it's worth to try it out before using Processor API...

For your second question, I would recommend to use Processor API an
attach a state store to a processor node and write to your data
warehouse whenever a state is "complete" (see
http://docs.confluent.io/3.0.1/streams/developer-guide.html#defining-a-state-store).

One more hint: you can actually mix DSL and Processor API by using (eg.
process() or transform() within DSL).


Hope this gives you some initial pointers. Please follow up if you have
more questions.

-Matthias


On 09/24/2016 11:25 PM, Arvind Kalyan wrote:
> Hello there!
> 
> I read about Kafka Streams recently, pretty interesting the way it solves
> the stream processing problem in a more cleaner way with less overheads and
> complexities.
> 
> I work as a Software Engineer in a startup, and we are in the design stage
> for building a stream processing pipeline (if you will) for the millions of
> events we get every day. We use Kafka (cluster) as the log aggregation
> layer already in production a 5-6 months back and very happy about it.
> 
> I went through a few confluent blogs (by Jay, Neha) as to how KStreams
> solve for sort of a state-ful event processing, and maybe I missed the
> whole point in this regard, I have some doubts.
> 
> We have use-cases like the following:
> 
> There is an event E1, which is sort-of the base event after which we have a
> lot of sub- events E2,E3..En enriching E1 with lot of extra properties
> (with considerable delay, say 30-40 mins).
> 
> Eg. 1: An order event has come in where the user has ordered an item on our
> website (This is the base event). After say 30-40 minutes, we get events
> like packaging_time, shipping_time, delivered_time or cancelled_time etc
> related to that order (These are the sub-events).
> 
> So before we get the whole event to a warehouse, we need to collect all
> these (ordered, packaged, shipped, cancelled/delivered), and whenever I get
> a cancelled or delivered event for an order, I know that completes the
> lifecycle for that order, and can put it in the warehouse.
> 
> Eg. 2: User login events - If we are to capture events like User-Logged-In,
> User-Logged-Out, I need it to be in the warehouse as a single row. Eg. row
> would have these columns *user_id, login_time, logout_time*. So as and when
> I receive a logout event (and if I have login event stored in some store),
> there would be a trigger which combines both, and send it across to the
> warehouse.
> 
> All these involve storing the state of the events and act as-and-when
> another event (that completes lifecycle) occurs, after which you have a
> trigger for further steps (warehouse or anything else).
> 
> Does KStream help me do this? If not, how should I go about solving this
> problem?
> 
> Also, I wanted some advice as to whether it is a standard practice to
> aggregate like this and *then* store to warehouse, or should I append each
> event into the warehouse and do sort-of an ELT on that using the warehouse?
> (Run a query to re-structure the data in the database and store it off as a
> separate table)
> 
> Eagerly waiting for your reply,
> Arvind
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to