Thanks Navina, it is much more clear now. Unfortunately, in our case, we can not bootstrap the data in advance(we can't pre-fetch all existing URL's titles and headers in advance). Sounds to me that, if we want to use Samza, we will need a background process that will be synchronized with the main event loop of the task (+hande back-pressure so not more than X requests can be made simultaneously).
Regards, Michael On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh < nram...@linkedin.com.invalid> wrote: > Hi Michael, > {quote} > Do you mean that in such a case Samza should be combined with another > Stream processing framework (such as Storm)? > {quote} > No. I didn't mean combining it with any other framework. > > {quote} > "the job bootstraps the data from the source" - do you mean that > you have a background process for this purpose or just listen to an > additional stream of change log from some other framework? > {quote} > I didn't mean a background process. I meant just listening from a stream of > change log from a data source. > > At LinkedIn, we use databus. The jobs will configure databus (for a give > data source) as one of the input streams for the job. Databus is a source > agnostic distributed change data capture system. You can find more > information here <https://github.com/linkedin/databus>. The advantage is > that the databus client is capable of "bootstrapping" from the source > automatically and then, switching to simply capture changes from the data > source. In this scenario, Samza doesn't do anything special, except that it > will continue consuming from databus stream when bootstrapping. Once > bootstrap is complete, the job can start processing events from other input > streams as well. > > I hope my explanation clarifies your question. :) > > Thanks! > Navina > > > On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mikesk...@gmail.com> > wrote: > > > Thank you for your replies, > > > > I understand that making an external blocking request in a single event > > thread will result in extremely low throughput. However this can be > solved > > by multi threading and/or asynchronous approach. It is clear that in any > > case using external services can never achieve the throughput of simple > > transformations. However most stream processing need, from time to time, > to > > query some external storage, web service etc... > > > > Do you mean that in such a case Samza should be combined with another > > Stream processing framework (such as Storm)? > > > > Navina, "the job bootstraps the data from the source" - do you mean that > > you have a background process for this purpose or just listen to an > > additional stream of change log from some other framework? > > > > Thanks, > > Michael > > > > On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh > > <nram...@linkedin.com.invalid > > > wrote: > > > > > Hi Michael, > > > I agree with what Yan said. While nothing stops you from doing it, it > is > > > not encouraged as it affect throughput and realtime processing. > > > > > > {quote} > > > It seems that Samza design suits very well "data transformation" > > scenarios, > > > what is not clear is how well can it support external services? > > > {quote} > > > We have some similar use-cases at LinkedIn where the Samza jobs need to > > > query to external data sources. We do use a pattern where the job > > > bootstraps the data from the source using a change-capture system like > > > databus and buffer it locally, before processing from input streams. > > > Depending on the scale of your data, this model may or may not work for > > > you. However, there is no in-built support for this in Samza. > > > > > > Thanks! > > > Navina > > > > > > On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <yanfang...@gmail.com> > wrote: > > > > > > > Hi Michael, > > > > > > > > Samza is designed for high-throughput and realtime processing. If you > > are > > > > using HTTP request/external service, you may not retrieve the same > > > > performance as not using it. However, technically speaking, there is > > > > nothing blocking you to do this, (well, discouraged anyway :). Samza > by > > > > default does not provide this feature. So you maybe a little cautious > > > when > > > > implementing this. > > > > > > > > Thanks, > > > > > > > > Fang, Yan > > > > yanfang...@gmail.com > > > > > > > > On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikesk...@gmail.com > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > What would be the best approach for doing "blocking" operations in > > > Samza? > > > > > > > > > > For example, we have a kafka stream of urls for which we need to > > gather > > > > > external data via HTTP (such as alexa rank, get the page title and > > > > > headers..). Other scenarios include database access and decision > > making > > > > via > > > > > a rule engine. > > > > > > > > > > Samza processes messages in a singe thread, HTTP requests might > take > > > > > hundreds of miliseconds. With the single threaded design the > > throughput > > > > > would be very limited, which can be solved with an asynchronous > > > approach. > > > > > However Samza documentation explicitely states > > > > > "*You are strongly discouraged from using threads in your job’s > > code*". > > > > > > > > > > It seems that Samza design suits very well "data transformation" > > > > scenarios, > > > > > what is not clear is how well can it support external services? > > > > > > > > > > Thanks, > > > > > Michael Sklyar > > > > > > > > > > > > > > > > > > > > > -- > > > Navina R. > > > > > > > > > -- > Navina R. >