Michael, Why not just have a pool of workers outside of Samza that are pushing the raw, or subset of the raw crawler input into a Kafka topic then have the Samza do the compute/stream work? Basically Samza is not the right tool for what your suggesting but could be used for downstream work, in my opinion. -Jordan
On Mon, Sep 21, 2015 at 9:08 AM, Ken Krugler <[email protected]> wrote: > Hi Michael (& Navina), > > I don't think you need to create a separate background process, at least > for the case of web crawling. > > The challenge is to efficiently use one Samza process to simultaneously > fetch many URLs. > > Which does increase the complexity of that process's code, as you wind up > having to manage either a multi-threaded or async fetch state. > > But that's the same as for Hadoop-based crawlers, where you have a limited > number of parallel reduce tasks that are doing the fetching - see Nutch and > Bixo for examples, e.g. FetchBuffer. > > And it's the same for storm-crawler, another project I've been involved > with in the past. > > -- Ken > > > From: Michael Sklyar > > Sent: September 21, 2015 5:19:52am PDT > > To: [email protected] > > Subject: Re: Asynchronous approach and samza > > > > Thanks Navina, > > it is much more clear now. > > > > Unfortunately, in our case, we can not bootstrap the data in advance(we > > can't pre-fetch all existing URL's titles and headers in advance). > > Sounds to me that, if we want to use Samza, we will need a background > > process that will be synchronized with the main event loop of the task > > (+hande back-pressure so not more than X requests can be made > > simultaneously). > > > > > > Regards, > > Michael > > > > On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh < > > [email protected]> wrote: > > > >> Hi Michael, > >> {quote} > >> Do you mean that in such a case Samza should be combined with another > >> Stream processing framework (such as Storm)? > >> {quote} > >> No. I didn't mean combining it with any other framework. > >> > >> {quote} > >> "the job bootstraps the data from the source" - do you mean that > >> you have a background process for this purpose or just listen to an > >> additional stream of change log from some other framework? > >> {quote} > >> I didn't mean a background process. I meant just listening from a > stream of > >> change log from a data source. > >> > >> At LinkedIn, we use databus. The jobs will configure databus (for a give > >> data source) as one of the input streams for the job. Databus is a > source > >> agnostic distributed change data capture system. You can find more > >> information here <https://github.com/linkedin/databus>. The advantage > is > >> that the databus client is capable of "bootstrapping" from the source > >> automatically and then, switching to simply capture changes from the > data > >> source. In this scenario, Samza doesn't do anything special, except > that it > >> will continue consuming from databus stream when bootstrapping. Once > >> bootstrap is complete, the job can start processing events from other > input > >> streams as well. > >> > >> I hope my explanation clarifies your question. :) > >> > >> Thanks! > >> Navina > >> > >> > >> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <[email protected]> > >> wrote: > >> > >>> Thank you for your replies, > >>> > >>> I understand that making an external blocking request in a single event > >>> thread will result in extremely low throughput. However this can be > >> solved > >>> by multi threading and/or asynchronous approach. It is clear that in > any > >>> case using external services can never achieve the throughput of simple > >>> transformations. However most stream processing need, from time to > time, > >> to > >>> query some external storage, web service etc... > >>> > >>> Do you mean that in such a case Samza should be combined with another > >>> Stream processing framework (such as Storm)? > >>> > >>> Navina, "the job bootstraps the data from the source" - do you mean > that > >>> you have a background process for this purpose or just listen to an > >>> additional stream of change log from some other framework? > >>> > >>> Thanks, > >>> Michael > >>> > >>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh > >>> <[email protected] > >>>> wrote: > >>> > >>>> Hi Michael, > >>>> I agree with what Yan said. While nothing stops you from doing it, it > >> is > >>>> not encouraged as it affect throughput and realtime processing. > >>>> > >>>> {quote} > >>>> It seems that Samza design suits very well "data transformation" > >>> scenarios, > >>>> what is not clear is how well can it support external services? > >>>> {quote} > >>>> We have some similar use-cases at LinkedIn where the Samza jobs need > to > >>>> query to external data sources. We do use a pattern where the job > >>>> bootstraps the data from the source using a change-capture system like > >>>> databus and buffer it locally, before processing from input streams. > >>>> Depending on the scale of your data, this model may or may not work > for > >>>> you. However, there is no in-built support for this in Samza. > >>>> > >>>> Thanks! > >>>> Navina > >>>> > >>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <[email protected]> > >> wrote: > >>>> > >>>>> Hi Michael, > >>>>> > >>>>> Samza is designed for high-throughput and realtime processing. If you > >>> are > >>>>> using HTTP request/external service, you may not retrieve the same > >>>>> performance as not using it. However, technically speaking, there is > >>>>> nothing blocking you to do this, (well, discouraged anyway :). Samza > >> by > >>>>> default does not provide this feature. So you maybe a little cautious > >>>> when > >>>>> implementing this. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Fang, Yan > >>>>> [email protected] > >>>>> > >>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <[email protected] > >>> > >>>>> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> What would be the best approach for doing "blocking" operations in > >>>> Samza? > >>>>>> > >>>>>> For example, we have a kafka stream of urls for which we need to > >>> gather > >>>>>> external data via HTTP (such as alexa rank, get the page title and > >>>>>> headers..). Other scenarios include database access and decision > >>> making > >>>>> via > >>>>>> a rule engine. > >>>>>> > >>>>>> Samza processes messages in a singe thread, HTTP requests might > >> take > >>>>>> hundreds of miliseconds. With the single threaded design the > >>> throughput > >>>>>> would be very limited, which can be solved with an asynchronous > >>>> approach. > >>>>>> However Samza documentation explicitely states > >>>>>> "*You are strongly discouraged from using threads in your job’s > >>> code*". > >>>>>> > >>>>>> It seems that Samza design suits very well "data transformation" > >>>>> scenarios, > >>>>>> what is not clear is how well can it support external services? > >>>>>> > >>>>>> Thanks, > >>>>>> Michael Sklyar > > > > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > -- Jordan Shaw Full Stack Software Engineer PubNub Inc 1045 17th St San Francisco, CA 94107
