(continuing from previous email) in addition to not wanting to duplicate code, say that some of the Samza jobs need to build up state, and it's important to build up this state from all of those old events no longer in Kafka. If that state was only built from the last 7 days of events, some things would be missing and the data would be incomplete.
On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote: > Let's also add to the story: say the company wants to only write code for > Samza, and not duplicate the same code in MapReduce jobs (or any other > framework). > > On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: > >> Why not run a map reduce job on the data in hdfs? what is was made for. >> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> >> > Hi - >> > >> > Let's say one day a company wants to start doing all of this awesome >> data >> > integration/near-real-time stream processing stuff, so they start >> sending >> > their user activity events (e.g. pageviews, ad impressions, etc) to >> Kafka. >> > Then they hook up Camus to copy new events from Kafka to HDFS every >> hour. >> > They use the default Kafka log retention period of 7 days. So after a >> few >> > months, Kafka has the last 7 days of events, and HDFS has all events >> except >> > the newest events not yet transferred by Camus. >> > >> > Then the company wants to build out a system that uses Samza to process >> the >> > user activity events from Kafka and output it to some queryable data >> store. >> > If standard Samza reprocessing [1] is used, then only the last 7 days of >> > events in Kafka get processed and put into the data store. Of course, >> then >> > all future events also seamlessly get processed by the Samza jobs and >> put >> > into the data store, which is awesome. >> > >> > But let's say this company needs all of the historical events to be >> > processed by Samza and put into the data store (i.e. the events older >> than >> > 7 days that are in HDFS but no longer in Kafka). It's a Business >> Critical >> > thing and absolutely must happen. How should this company achieve this? >> > >> > I'm sure there are many potential solutions to this problem, but has >> anyone >> > actually done this? What approach did you take? >> > >> > Any experiences or thoughts would be hugely appreciated. >> > >> > Thanks, >> > Zach >> > >> > [1] >> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html >> > >> >