That said, since we don¹t yet support consuming from hdfs, one workaround
would be to periodically read from hdfs and pump the data to a kafka topic
(say topic A) using a hadoop / yarn based job. Then, in your Samza job,
you can bootstrap from topic A and then, continue processing the latest
messages from the other Kafka topic.

Thanks!
Navina

On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote:

>Hi Zach,
>
>It sounds like you are asking for a SystemConsumer for hdfs. Does
>SAMZA-263 match your requirements?
>
>Thanks!
>Navina
>
>On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote:
>
>>(continuing from previous email) in addition to not wanting to duplicate
>>code, say that some of the Samza jobs need to build up state, and it's
>>important to build up this state from all of those old events no longer
>>in
>>Kafka. If that state was only built from the last 7 days of events, some
>>things would be missing and the data would be incomplete.
>>
>>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote:
>>
>>> Let's also add to the story: say the company wants to only write code
>>>for
>>> Samza, and not duplicate the same code in MapReduce jobs (or any other
>>> framework).
>>>
>>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote:
>>>
>>>> Why not run a map reduce job on the data in hdfs? what is was made
>>>>for.
>>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote:
>>>>
>>>> > Hi -
>>>> >
>>>> > Let's say one day a company wants to start doing all of this awesome
>>>> data
>>>> > integration/near-real-time stream processing stuff, so they start
>>>> sending
>>>> > their user activity events (e.g. pageviews, ad impressions, etc) to
>>>> Kafka.
>>>> > Then they hook up Camus to copy new events from Kafka to HDFS every
>>>> hour.
>>>> > They use the default Kafka log retention period of 7 days. So after
>>>>a
>>>> few
>>>> > months, Kafka has the last 7 days of events, and HDFS has all events
>>>> except
>>>> > the newest events not yet transferred by Camus.
>>>> >
>>>> > Then the company wants to build out a system that uses Samza to
>>>>process
>>>> the
>>>> > user activity events from Kafka and output it to some queryable data
>>>> store.
>>>> > If standard Samza reprocessing [1] is used, then only the last 7
>>>>days of
>>>> > events in Kafka get processed and put into the data store. Of
>>>>course,
>>>> then
>>>> > all future events also seamlessly get processed by the Samza jobs
>>>>and
>>>> put
>>>> > into the data store, which is awesome.
>>>> >
>>>> > But let's say this company needs all of the historical events to be
>>>> > processed by Samza and put into the data store (i.e. the events
>>>>older
>>>> than
>>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
>>>> Critical
>>>> > thing and absolutely must happen. How should this company achieve
>>>>this?
>>>> >
>>>> > I'm sure there are many potential solutions to this problem, but has
>>>> anyone
>>>> > actually done this? What approach did you take?
>>>> >
>>>> > Any experiences or thoughts would be hugely appreciated.
>>>> >
>>>> > Thanks,
>>>> > Zach
>>>> >
>>>> > [1]
>>>> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
>>>> >
>>>>
>>>
>

Reply via email to