Hi - Let's say one day a company wants to start doing all of this awesome data integration/near-real-time stream processing stuff, so they start sending their user activity events (e.g. pageviews, ad impressions, etc) to Kafka. Then they hook up Camus to copy new events from Kafka to HDFS every hour. They use the default Kafka log retention period of 7 days. So after a few months, Kafka has the last 7 days of events, and HDFS has all events except the newest events not yet transferred by Camus.
Then the company wants to build out a system that uses Samza to process the user activity events from Kafka and output it to some queryable data store. If standard Samza reprocessing [1] is used, then only the last 7 days of events in Kafka get processed and put into the data store. Of course, then all future events also seamlessly get processed by the Samza jobs and put into the data store, which is awesome. But let's say this company needs all of the historical events to be processed by Samza and put into the data store (i.e. the events older than 7 days that are in HDFS but no longer in Kafka). It's a Business Critical thing and absolutely must happen. How should this company achieve this? I'm sure there are many potential solutions to this problem, but has anyone actually done this? What approach did you take? Any experiences or thoughts would be hugely appreciated. Thanks, Zach [1] http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html