Hi, Zach In my humble opinion once I dont have much experience with Samza. But I think that a possible solution could be hybrid. Using HBase Coprocessor to running a Samza application to process the events and storing the results on HBase. Thus you have the history available to query it.
Thanks. 2015-05-29 19:36 GMT-03:00 Navina Ramesh <nram...@linkedin.com.invalid>: > Hi Zach, > > Yes. Bootstrapping from hdfs directly is possible, provided you have a > hdfs consumer defined (SAMZA-263 is a pre-requisite). :) > > If you have a way of getting hdfs data into Kafka, then you can use it as > a bootstrap stream. However, a bootstrap stream is a temporarily privilege > and always consumes all messages in the stream. > > If you want to use the offsets stored by Camus to determine the right > place to transition to realtime Kafka, then you might consider > implementing a custom MessageChooser [1]. This will give you more > fine-grained control over message selection from your input streams. > > Cheers! > Navina > > [1] > https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apa > che/samza/system/chooser/MessageChooser.java#L26 > > > On 5/29/15, 3:07 PM, "Zach Cox" <zcox...@gmail.com> wrote: > > >Hi Navina, > > > >Do you mean bootstrapping from hdfs as in [1]? That is an interesting idea > >I hadn't thought of. Maybe that could be combined with the offsets stored > >by Camus to determine the right place to transition to the real-time kafka > >stream? > > > >Thanks, > >Zach > > > >[1] > > > http://samza.apache.org/learn/documentation/0.9/container/streams.html#boo > >tstrapping > > > > > >On Fri, May 29, 2015 at 4:53 PM Navina Ramesh > ><nram...@linkedin.com.invalid> > >wrote: > > > >> Hi Zach, > >> > >> I agree. It is not a good idea to keep the entire set of historical data > >> in Kafka. > >> The retention period in Kafka does make it trickier to synchronize with > >> your hadoop data pump. I am not very familiar with Camus2Kafka project. > >> But that sounds like a workable solution. > >> > >> Ideal solution would be to consume/bootstrap directly from HDFS :) > >> > >> Cheers! > >> Navina > >> > >> On 5/29/15, 2:44 PM, "Zach Cox" <zcox...@gmail.com> wrote: > >> > >> >Hi Navina, > >> > > >> >A similar approach I considered was using an infinite/very large > >>retention > >> >period on those event kafka topics, so they would always contain all > >> >historical events. Then standard Samza reprocessing goes through all > >>old > >> >events. > >> > > >> >I'm hesitant to pursue that though, as those topic partitions then grow > >> >unbounded over time, which seems problematic. > >> > > >> >Thanks, > >> >Zach > >> > > >> >On Fri, May 29, 2015 at 4:34 PM Navina Ramesh > >> ><nram...@linkedin.com.invalid> > >> >wrote: > >> > > >> >> That said, since we don¹t yet support consuming from hdfs, one > >> >>workaround > >> >> would be to periodically read from hdfs and pump the data to a kafka > >> >>topic > >> >> (say topic A) using a hadoop / yarn based job. Then, in your Samza > >>job, > >> >> you can bootstrap from topic A and then, continue processing the > >>latest > >> >> messages from the other Kafka topic. > >> >> > >> >> Thanks! > >> >> Navina > >> >> > >> >> On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote: > >> >> > >> >> >Hi Zach, > >> >> > > >> >> >It sounds like you are asking for a SystemConsumer for hdfs. Does > >> >> >SAMZA-263 match your requirements? > >> >> > > >> >> >Thanks! > >> >> >Navina > >> >> > > >> >> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote: > >> >> > > >> >> >>(continuing from previous email) in addition to not wanting to > >> >>duplicate > >> >> >>code, say that some of the Samza jobs need to build up state, and > >>it's > >> >> >>important to build up this state from all of those old events no > >> >>longer > >> >> >>in > >> >> >>Kafka. If that state was only built from the last 7 days of events, > >> >>some > >> >> >>things would be missing and the data would be incomplete. > >> >> >> > >> >> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> > wrote: > >> >> >> > >> >> >>> Let's also add to the story: say the company wants to only write > >> >>code > >> >> >>>for > >> >> >>> Samza, and not duplicate the same code in MapReduce jobs (or any > >> >>other > >> >> >>> framework). > >> >> >>> > >> >> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: > >> >> >>> > >> >> >>>> Why not run a map reduce job on the data in hdfs? what is was > >>made > >> >> >>>>for. > >> >> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: > >> >> >>>> > >> >> >>>> > Hi - > >> >> >>>> > > >> >> >>>> > Let's say one day a company wants to start doing all of this > >> >>awesome > >> >> >>>> data > >> >> >>>> > integration/near-real-time stream processing stuff, so they > >>start > >> >> >>>> sending > >> >> >>>> > their user activity events (e.g. pageviews, ad impressions, > >>etc) > >> >>to > >> >> >>>> Kafka. > >> >> >>>> > Then they hook up Camus to copy new events from Kafka to HDFS > >> >>every > >> >> >>>> hour. > >> >> >>>> > They use the default Kafka log retention period of 7 days. So > >> >>after > >> >> >>>>a > >> >> >>>> few > >> >> >>>> > months, Kafka has the last 7 days of events, and HDFS has all > >> >>events > >> >> >>>> except > >> >> >>>> > the newest events not yet transferred by Camus. > >> >> >>>> > > >> >> >>>> > Then the company wants to build out a system that uses Samza > >>to > >> >> >>>>process > >> >> >>>> the > >> >> >>>> > user activity events from Kafka and output it to some > >>queryable > >> >>data > >> >> >>>> store. > >> >> >>>> > If standard Samza reprocessing [1] is used, then only the > >>last 7 > >> >> >>>>days of > >> >> >>>> > events in Kafka get processed and put into the data store. Of > >> >> >>>>course, > >> >> >>>> then > >> >> >>>> > all future events also seamlessly get processed by the Samza > >>jobs > >> >> >>>>and > >> >> >>>> put > >> >> >>>> > into the data store, which is awesome. > >> >> >>>> > > >> >> >>>> > But let's say this company needs all of the historical events > >>to > >> >>be > >> >> >>>> > processed by Samza and put into the data store (i.e. the > >>events > >> >> >>>>older > >> >> >>>> than > >> >> >>>> > 7 days that are in HDFS but no longer in Kafka). It's a > >>Business > >> >> >>>> Critical > >> >> >>>> > thing and absolutely must happen. How should this company > >>achieve > >> >> >>>>this? > >> >> >>>> > > >> >> >>>> > I'm sure there are many potential solutions to this problem, > >>but > >> >>has > >> >> >>>> anyone > >> >> >>>> > actually done this? What approach did you take? > >> >> >>>> > > >> >> >>>> > Any experiences or thoughts would be hugely appreciated. > >> >> >>>> > > >> >> >>>> > Thanks, > >> >> >>>> > Zach > >> >> >>>> > > >> >> >>>> > [1] > >> >> >>>> > >> >> > >>http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html > >> >> >>>> > > >> >> >>>> > >> >> >>> > >> >> > > >> >> > >> >> > >> > >> > >