Hi Zach,

It sounds like you are asking for a SystemConsumer for hdfs. Does
SAMZA-263 match your requirements?

Thanks!
Navina

On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote:

>(continuing from previous email) in addition to not wanting to duplicate
>code, say that some of the Samza jobs need to build up state, and it's
>important to build up this state from all of those old events no longer in
>Kafka. If that state was only built from the last 7 days of events, some
>things would be missing and the data would be incomplete.
>
>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote:
>
>> Let's also add to the story: say the company wants to only write code
>>for
>> Samza, and not duplicate the same code in MapReduce jobs (or any other
>> framework).
>>
>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote:
>>
>>> Why not run a map reduce job on the data in hdfs? what is was made for.
>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote:
>>>
>>> > Hi -
>>> >
>>> > Let's say one day a company wants to start doing all of this awesome
>>> data
>>> > integration/near-real-time stream processing stuff, so they start
>>> sending
>>> > their user activity events (e.g. pageviews, ad impressions, etc) to
>>> Kafka.
>>> > Then they hook up Camus to copy new events from Kafka to HDFS every
>>> hour.
>>> > They use the default Kafka log retention period of 7 days. So after a
>>> few
>>> > months, Kafka has the last 7 days of events, and HDFS has all events
>>> except
>>> > the newest events not yet transferred by Camus.
>>> >
>>> > Then the company wants to build out a system that uses Samza to
>>>process
>>> the
>>> > user activity events from Kafka and output it to some queryable data
>>> store.
>>> > If standard Samza reprocessing [1] is used, then only the last 7
>>>days of
>>> > events in Kafka get processed and put into the data store. Of course,
>>> then
>>> > all future events also seamlessly get processed by the Samza jobs and
>>> put
>>> > into the data store, which is awesome.
>>> >
>>> > But let's say this company needs all of the historical events to be
>>> > processed by Samza and put into the data store (i.e. the events older
>>> than
>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
>>> Critical
>>> > thing and absolutely must happen. How should this company achieve
>>>this?
>>> >
>>> > I'm sure there are many potential solutions to this problem, but has
>>> anyone
>>> > actually done this? What approach did you take?
>>> >
>>> > Any experiences or thoughts would be hugely appreciated.
>>> >
>>> > Thanks,
>>> > Zach
>>> >
>>> > [1]
>>> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
>>> >
>>>
>>

Reply via email to