Hi -

Let's say one day a company wants to start doing all of this awesome data
integration/near-real-time stream processing stuff, so they start sending
their user activity events (e.g. pageviews, ad impressions, etc) to Kafka.
Then they hook up Camus to copy new events from Kafka to HDFS every hour.
They use the default Kafka log retention period of 7 days. So after a few
months, Kafka has the last 7 days of events, and HDFS has all events except
the newest events not yet transferred by Camus.

Then the company wants to build out a system that uses Samza to process the
user activity events from Kafka and output it to some queryable data store.
If standard Samza reprocessing [1] is used, then only the last 7 days of
events in Kafka get processed and put into the data store. Of course, then
all future events also seamlessly get processed by the Samza jobs and put
into the data store, which is awesome.

But let's say this company needs all of the historical events to be
processed by Samza and put into the data store (i.e. the events older than
7 days that are in HDFS but no longer in Kafka). It's a Business Critical
thing and absolutely must happen. How should this company achieve this?

I'm sure there are many potential solutions to this problem, but has anyone
actually done this? What approach did you take?

Any experiences or thoughts would be hugely appreciated.

Thanks,
Zach

[1] http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html

Reply via email to