Michael,
Why not just have a pool of workers outside of Samza that are pushing the
raw, or subset of the raw crawler input into a Kafka topic then have the
Samza do the compute/stream work? Basically Samza is not the right tool for
what your suggesting but could be used for downstream work, in my opinion.
-Jordan

On Mon, Sep 21, 2015 at 9:08 AM, Ken Krugler <[email protected]>
wrote:

> Hi Michael (& Navina),
>
> I don't think you need to create a separate background process, at least
> for the case of web crawling.
>
> The challenge is to efficiently use one Samza process to simultaneously
> fetch many URLs.
>
> Which does increase the complexity of that process's code, as you wind up
> having to manage either a multi-threaded or async fetch state.
>
> But that's the same as for Hadoop-based crawlers, where you have a limited
> number of parallel reduce tasks that are doing the fetching - see Nutch and
> Bixo for examples, e.g. FetchBuffer.
>
> And it's the same for storm-crawler, another project I've been involved
> with in the past.
>
> -- Ken
>
> > From: Michael Sklyar
> > Sent: September 21, 2015 5:19:52am PDT
> > To: [email protected]
> > Subject: Re: Asynchronous approach and samza
> >
> > Thanks Navina,
> > it is much more clear now.
> >
> > Unfortunately, in our case, we can not bootstrap the data in advance(we
> > can't pre-fetch all existing URL's titles and headers in advance).
> > Sounds to me that, if we want to use Samza, we will need a background
> > process that will be synchronized with the main event loop of the task
> > (+hande back-pressure so not more than X requests can be made
> > simultaneously).
> >
> >
> > Regards,
> > Michael
> >
> > On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
> > [email protected]> wrote:
> >
> >> Hi Michael,
> >> {quote}
> >> Do you mean that in such a case Samza should be combined with another
> >> Stream processing framework (such as Storm)?
> >> {quote}
> >> No. I didn't mean combining it with any other framework.
> >>
> >> {quote}
> >> "the job bootstraps the data from the source" - do you mean that
> >> you have a background process for this purpose or just listen to an
> >> additional stream of change log from some other framework?
> >> {quote}
> >> I didn't mean a background process. I meant just listening from a
> stream of
> >> change log from a data source.
> >>
> >> At LinkedIn, we use databus. The jobs will configure databus (for a give
> >> data source) as one of the input streams for the job. Databus is a
> source
> >> agnostic distributed change data capture system. You can find more
> >> information here <https://github.com/linkedin/databus>. The advantage
> is
> >> that the databus client is capable of "bootstrapping" from the source
> >> automatically and then, switching to simply capture changes from the
> data
> >> source. In this scenario, Samza doesn't do anything special, except
> that it
> >> will continue consuming from databus stream when bootstrapping. Once
> >> bootstrap is complete, the job can start processing events from other
> input
> >> streams as well.
> >>
> >> I hope my explanation clarifies your question. :)
> >>
> >> Thanks!
> >> Navina
> >>
> >>
> >> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <[email protected]>
> >> wrote:
> >>
> >>> Thank you for your replies,
> >>>
> >>> I understand that making an external blocking request in a single event
> >>> thread will result in extremely low throughput. However this can be
> >> solved
> >>> by multi threading and/or asynchronous approach. It is clear that in
> any
> >>> case using external services can never achieve the throughput of simple
> >>> transformations. However most stream processing need, from time to
> time,
> >> to
> >>> query some external storage, web service etc...
> >>>
> >>> Do you mean that in such a case Samza should be combined with another
> >>> Stream processing framework (such as Storm)?
> >>>
> >>> Navina, "the job bootstraps the data from the source" - do you mean
> that
> >>> you have a background process for this purpose or just listen to an
> >>> additional stream of change log from some other framework?
> >>>
> >>> Thanks,
> >>> Michael
> >>>
> >>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> >>> <[email protected]
> >>>> wrote:
> >>>
> >>>> Hi Michael,
> >>>> I agree with what Yan said. While nothing stops you from doing it, it
> >> is
> >>>> not encouraged as it affect throughput and realtime processing.
> >>>>
> >>>> {quote}
> >>>> It seems that Samza design suits very well "data transformation"
> >>> scenarios,
> >>>> what is not clear is how well can it support external services?
> >>>> {quote}
> >>>> We have some similar use-cases at LinkedIn where the Samza jobs need
> to
> >>>> query to external data sources. We do use a pattern where the job
> >>>> bootstraps the data from the source using a change-capture system like
> >>>> databus and buffer it locally, before processing from input streams.
> >>>> Depending on the scale of your data, this model may or may not work
> for
> >>>> you. However, there is no in-built support for this in Samza.
> >>>>
> >>>> Thanks!
> >>>> Navina
> >>>>
> >>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <[email protected]>
> >> wrote:
> >>>>
> >>>>> Hi Michael,
> >>>>>
> >>>>> Samza is designed for high-throughput and realtime processing. If you
> >>> are
> >>>>> using HTTP request/external service, you may not retrieve the same
> >>>>> performance as not using it. However, technically speaking, there is
> >>>>> nothing blocking you to do this, (well, discouraged anyway :). Samza
> >> by
> >>>>> default does not provide this feature. So you maybe a little cautious
> >>>> when
> >>>>> implementing this.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Fang, Yan
> >>>>> [email protected]
> >>>>>
> >>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <[email protected]
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> What would be the best approach for doing "blocking" operations in
> >>>> Samza?
> >>>>>>
> >>>>>> For example, we have a kafka stream of urls for which we need to
> >>> gather
> >>>>>> external data via HTTP (such as alexa rank, get the page title and
> >>>>>> headers..). Other scenarios include database access and decision
> >>> making
> >>>>> via
> >>>>>> a rule engine.
> >>>>>>
> >>>>>> Samza processes messages in a singe thread, HTTP requests might
> >> take
> >>>>>> hundreds of miliseconds. With the single threaded design the
> >>> throughput
> >>>>>> would be very limited, which can be solved with an asynchronous
> >>>> approach.
> >>>>>> However Samza documentation explicitely states
> >>>>>> "*You are strongly discouraged from using threads in your job’s
> >>> code*".
> >>>>>>
> >>>>>> It seems that Samza design suits very well "data transformation"
> >>>>> scenarios,
> >>>>>> what is not clear is how well can it support external services?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Michael Sklyar
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
Jordan Shaw
Full Stack Software Engineer
PubNub Inc
1045 17th St
San Francisco, CA 94107

Reply via email to