That said, since we don¹t yet support consuming from hdfs, one workaround would be to periodically read from hdfs and pump the data to a kafka topic (say topic A) using a hadoop / yarn based job. Then, in your Samza job, you can bootstrap from topic A and then, continue processing the latest messages from the other Kafka topic.
Thanks! Navina On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote: >Hi Zach, > >It sounds like you are asking for a SystemConsumer for hdfs. Does >SAMZA-263 match your requirements? > >Thanks! >Navina > >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote: > >>(continuing from previous email) in addition to not wanting to duplicate >>code, say that some of the Samza jobs need to build up state, and it's >>important to build up this state from all of those old events no longer >>in >>Kafka. If that state was only built from the last 7 days of events, some >>things would be missing and the data would be incomplete. >> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote: >> >>> Let's also add to the story: say the company wants to only write code >>>for >>> Samza, and not duplicate the same code in MapReduce jobs (or any other >>> framework). >>> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: >>> >>>> Why not run a map reduce job on the data in hdfs? what is was made >>>>for. >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: >>>> >>>> > Hi - >>>> > >>>> > Let's say one day a company wants to start doing all of this awesome >>>> data >>>> > integration/near-real-time stream processing stuff, so they start >>>> sending >>>> > their user activity events (e.g. pageviews, ad impressions, etc) to >>>> Kafka. >>>> > Then they hook up Camus to copy new events from Kafka to HDFS every >>>> hour. >>>> > They use the default Kafka log retention period of 7 days. So after >>>>a >>>> few >>>> > months, Kafka has the last 7 days of events, and HDFS has all events >>>> except >>>> > the newest events not yet transferred by Camus. >>>> > >>>> > Then the company wants to build out a system that uses Samza to >>>>process >>>> the >>>> > user activity events from Kafka and output it to some queryable data >>>> store. >>>> > If standard Samza reprocessing [1] is used, then only the last 7 >>>>days of >>>> > events in Kafka get processed and put into the data store. Of >>>>course, >>>> then >>>> > all future events also seamlessly get processed by the Samza jobs >>>>and >>>> put >>>> > into the data store, which is awesome. >>>> > >>>> > But let's say this company needs all of the historical events to be >>>> > processed by Samza and put into the data store (i.e. the events >>>>older >>>> than >>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business >>>> Critical >>>> > thing and absolutely must happen. How should this company achieve >>>>this? >>>> > >>>> > I'm sure there are many potential solutions to this problem, but has >>>> anyone >>>> > actually done this? What approach did you take? >>>> > >>>> > Any experiences or thoughts would be hugely appreciated. >>>> > >>>> > Thanks, >>>> > Zach >>>> > >>>> > [1] >>>> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html >>>> > >>>> >>> >