Hi Thomas,

That definitely seems like a good approach - just need to figure out the
details of consuming old events from hdfs and then seamlessly switching
over to kafka for newer events. Seems like some new components of Samza
need to be built to do this.

Thanks,
Zach


On Fri, May 29, 2015 at 4:32 PM Thomas Bernhardt
<bernhardt...@yahoo.com.invalid> wrote:

> I think the application would want to replay historical events into Samza.
> I.e the application can replay any events older then X days from HDFS into
> Samza. Once Samza has processed the historical events, the application can
> switch input to the Kafka queue to process the more recent and finally also
> the currently-arriving events. This way the Samza code can stay the same.
> Best regards,Tom      From: Zach Cox <zcox...@gmail.com>
>  To: dev@samza.apache.org
>  Sent: Friday, May 29, 2015 5:20 PM
>  Subject: Re: Reprocessing old events no longer in Kafka
>
> Let's also add to the story: say the company wants to only write code for
> Samza, and not duplicate the same code in MapReduce jobs (or any other
> framework).
>
>
>
> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote:
>
> > Why not run a map reduce job on the data in hdfs? what is was made for.
> > On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote:
> >
> > > Hi -
> > >
> > > Let's say one day a company wants to start doing all of this awesome
> data
> > > integration/near-real-time stream processing stuff, so they start
> sending
> > > their user activity events (e.g. pageviews, ad impressions, etc) to
> > Kafka.
> > > Then they hook up Camus to copy new events from Kafka to HDFS every
> hour.
> > > They use the default Kafka log retention period of 7 days. So after a
> few
> > > months, Kafka has the last 7 days of events, and HDFS has all events
> > except
> > > the newest events not yet transferred by Camus.
> > >
> > > Then the company wants to build out a system that uses Samza to process
> > the
> > > user activity events from Kafka and output it to some queryable data
> > store.
> > > If standard Samza reprocessing [1] is used, then only the last 7 days
> of
> > > events in Kafka get processed and put into the data store. Of course,
> > then
> > > all future events also seamlessly get processed by the Samza jobs and
> put
> > > into the data store, which is awesome.
> > >
> > > But let's say this company needs all of the historical events to be
> > > processed by Samza and put into the data store (i.e. the events older
> > than
> > > 7 days that are in HDFS but no longer in Kafka). It's a Business
> Critical
> > > thing and absolutely must happen. How should this company achieve this?
> > >
> > > I'm sure there are many potential solutions to this problem, but has
> > anyone
> > > actually done this? What approach did you take?
> > >
> > > Any experiences or thoughts would be hugely appreciated.
> > >
> > > Thanks,
> > > Zach
> > >
> > > [1]
> > http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
> > >
> >
>
>
>

Reply via email to