Hi Navina, I did see that jira and it would definitely be useful. I was thinking of maybe trying to build a composite stream, that would first read old events from hdfs and then switch over to kafka.
Do you know if there has been any movement on treating hdfs as a samza stream? Thanks, Zach On Fri, May 29, 2015 at 4:27 PM Navina Ramesh <[email protected]> wrote: > Hi Zach, > > It sounds like you are asking for a SystemConsumer for hdfs. Does > SAMZA-263 match your requirements? > > Thanks! > Navina > > On 5/29/15, 2:23 PM, "Zach Cox" <[email protected]> wrote: > > >(continuing from previous email) in addition to not wanting to duplicate > >code, say that some of the Samza jobs need to build up state, and it's > >important to build up this state from all of those old events no longer in > >Kafka. If that state was only built from the last 7 days of events, some > >things would be missing and the data would be incomplete. > > > >On Fri, May 29, 2015 at 4:20 PM Zach Cox <[email protected]> wrote: > > > >> Let's also add to the story: say the company wants to only write code > >>for > >> Samza, and not duplicate the same code in MapReduce jobs (or any other > >> framework). > >> > >> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <[email protected]> wrote: > >> > >>> Why not run a map reduce job on the data in hdfs? what is was made for. > >>> On May 29, 2015 2:13 PM, "Zach Cox" <[email protected]> wrote: > >>> > >>> > Hi - > >>> > > >>> > Let's say one day a company wants to start doing all of this awesome > >>> data > >>> > integration/near-real-time stream processing stuff, so they start > >>> sending > >>> > their user activity events (e.g. pageviews, ad impressions, etc) to > >>> Kafka. > >>> > Then they hook up Camus to copy new events from Kafka to HDFS every > >>> hour. > >>> > They use the default Kafka log retention period of 7 days. So after a > >>> few > >>> > months, Kafka has the last 7 days of events, and HDFS has all events > >>> except > >>> > the newest events not yet transferred by Camus. > >>> > > >>> > Then the company wants to build out a system that uses Samza to > >>>process > >>> the > >>> > user activity events from Kafka and output it to some queryable data > >>> store. > >>> > If standard Samza reprocessing [1] is used, then only the last 7 > >>>days of > >>> > events in Kafka get processed and put into the data store. Of course, > >>> then > >>> > all future events also seamlessly get processed by the Samza jobs and > >>> put > >>> > into the data store, which is awesome. > >>> > > >>> > But let's say this company needs all of the historical events to be > >>> > processed by Samza and put into the data store (i.e. the events older > >>> than > >>> > 7 days that are in HDFS but no longer in Kafka). It's a Business > >>> Critical > >>> > thing and absolutely must happen. How should this company achieve > >>>this? > >>> > > >>> > I'm sure there are many potential solutions to this problem, but has > >>> anyone > >>> > actually done this? What approach did you take? > >>> > > >>> > Any experiences or thoughts would be hugely appreciated. > >>> > > >>> > Thanks, > >>> > Zach > >>> > > >>> > [1] > >>> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html > >>> > > >>> > >> > >
