Hi Navina,

I did see that jira and it would definitely be useful. I was thinking of
maybe trying to build a composite stream, that would first read old events
from hdfs and then switch over to kafka.

Do you know if there has been any movement on treating hdfs as a samza
stream?

Thanks,
Zach

On Fri, May 29, 2015 at 4:27 PM Navina Ramesh <[email protected]>
wrote:

> Hi Zach,
>
> It sounds like you are asking for a SystemConsumer for hdfs. Does
> SAMZA-263 match your requirements?
>
> Thanks!
> Navina
>
> On 5/29/15, 2:23 PM, "Zach Cox" <[email protected]> wrote:
>
> >(continuing from previous email) in addition to not wanting to duplicate
> >code, say that some of the Samza jobs need to build up state, and it's
> >important to build up this state from all of those old events no longer in
> >Kafka. If that state was only built from the last 7 days of events, some
> >things would be missing and the data would be incomplete.
> >
> >On Fri, May 29, 2015 at 4:20 PM Zach Cox <[email protected]> wrote:
> >
> >> Let's also add to the story: say the company wants to only write code
> >>for
> >> Samza, and not duplicate the same code in MapReduce jobs (or any other
> >> framework).
> >>
> >> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <[email protected]> wrote:
> >>
> >>> Why not run a map reduce job on the data in hdfs? what is was made for.
> >>> On May 29, 2015 2:13 PM, "Zach Cox" <[email protected]> wrote:
> >>>
> >>> > Hi -
> >>> >
> >>> > Let's say one day a company wants to start doing all of this awesome
> >>> data
> >>> > integration/near-real-time stream processing stuff, so they start
> >>> sending
> >>> > their user activity events (e.g. pageviews, ad impressions, etc) to
> >>> Kafka.
> >>> > Then they hook up Camus to copy new events from Kafka to HDFS every
> >>> hour.
> >>> > They use the default Kafka log retention period of 7 days. So after a
> >>> few
> >>> > months, Kafka has the last 7 days of events, and HDFS has all events
> >>> except
> >>> > the newest events not yet transferred by Camus.
> >>> >
> >>> > Then the company wants to build out a system that uses Samza to
> >>>process
> >>> the
> >>> > user activity events from Kafka and output it to some queryable data
> >>> store.
> >>> > If standard Samza reprocessing [1] is used, then only the last 7
> >>>days of
> >>> > events in Kafka get processed and put into the data store. Of course,
> >>> then
> >>> > all future events also seamlessly get processed by the Samza jobs and
> >>> put
> >>> > into the data store, which is awesome.
> >>> >
> >>> > But let's say this company needs all of the historical events to be
> >>> > processed by Samza and put into the data store (i.e. the events older
> >>> than
> >>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
> >>> Critical
> >>> > thing and absolutely must happen. How should this company achieve
> >>>this?
> >>> >
> >>> > I'm sure there are many potential solutions to this problem, but has
> >>> anyone
> >>> > actually done this? What approach did you take?
> >>> >
> >>> > Any experiences or thoughts would be hugely appreciated.
> >>> >
> >>> > Thanks,
> >>> > Zach
> >>> >
> >>> > [1]
> >>> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
> >>> >
> >>>
> >>
>
>

Reply via email to