Thanks Navina,
it is much more clear now.

Unfortunately, in our case, we can not bootstrap the data in advance(we
can't pre-fetch all existing URL's titles and headers in advance).
Sounds to me that, if we want to use Samza, we will need a background
process that will be synchronized with the main event loop of the task
(+hande back-pressure so not more than X requests can be made
simultaneously).


Regards,
Michael

On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
nram...@linkedin.com.invalid> wrote:

> Hi Michael,
> {quote}
> Do you mean that in such a case Samza should be combined with another
> Stream processing framework (such as Storm)?
> {quote}
> No. I didn't mean combining it with any other framework.
>
> {quote}
> "the job bootstraps the data from the source" - do you mean that
> you have a background process for this purpose or just listen to an
> additional stream of change log from some other framework?
> {quote}
> I didn't mean a background process. I meant just listening from a stream of
> change log from a data source.
>
> At LinkedIn, we use databus. The jobs will configure databus (for a give
> data source) as one of the input streams for the job. Databus is a source
> agnostic distributed change data capture system. You can find more
> information here <https://github.com/linkedin/databus>. The advantage is
> that the databus client is capable of "bootstrapping" from the source
> automatically and then, switching to simply capture changes from the data
> source. In this scenario, Samza doesn't do anything special, except that it
> will continue consuming from databus stream when bootstrapping. Once
> bootstrap is complete, the job can start processing events from other input
> streams as well.
>
> I hope my explanation clarifies your question. :)
>
> Thanks!
> Navina
>
>
> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mikesk...@gmail.com>
> wrote:
>
> > Thank you for your replies,
> >
> > I understand that making an external blocking request in a single event
> > thread will result in extremely low throughput. However this can be
> solved
> > by multi threading and/or asynchronous approach. It is clear that in any
> > case using external services can never achieve the throughput of simple
> > transformations. However most stream processing need, from time to time,
> to
> > query some external storage, web service etc...
> >
> > Do you mean that in such a case Samza should be combined with another
> > Stream processing framework (such as Storm)?
> >
> > Navina, "the job bootstraps the data from the source" - do you mean that
> > you have a background process for this purpose or just listen to an
> > additional stream of change log from some other framework?
> >
> > Thanks,
> > Michael
> >
> > On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> > <nram...@linkedin.com.invalid
> > > wrote:
> >
> > > Hi Michael,
> > > I agree with what Yan said. While nothing stops you from doing it, it
> is
> > > not encouraged as it affect throughput and realtime processing.
> > >
> > > {quote}
> > > It seems that Samza design suits very well "data transformation"
> > scenarios,
> > > what is not clear is how well can it support external services?
> > > {quote}
> > > We have some similar use-cases at LinkedIn where the Samza jobs need to
> > > query to external data sources. We do use a pattern where the job
> > > bootstraps the data from the source using a change-capture system like
> > > databus and buffer it locally, before processing from input streams.
> > > Depending on the scale of your data, this model may or may not work for
> > > you. However, there is no in-built support for this in Samza.
> > >
> > > Thanks!
> > > Navina
> > >
> > > On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <yanfang...@gmail.com>
> wrote:
> > >
> > > > Hi Michael,
> > > >
> > > > Samza is designed for high-throughput and realtime processing. If you
> > are
> > > > using HTTP request/external service, you may not retrieve the same
> > > > performance as not using it. However, technically speaking, there is
> > > > nothing blocking you to do this, (well, discouraged anyway :). Samza
> by
> > > > default does not provide this feature. So you maybe a little cautious
> > > when
> > > > implementing this.
> > > >
> > > > Thanks,
> > > >
> > > > Fang, Yan
> > > > yanfang...@gmail.com
> > > >
> > > > On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikesk...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > What would be the best approach for doing "blocking" operations in
> > > Samza?
> > > > >
> > > > > For example, we have a kafka stream of urls for which we need to
> > gather
> > > > > external data via HTTP (such as alexa rank, get the page title and
> > > > > headers..). Other scenarios include database access and decision
> > making
> > > > via
> > > > > a rule engine.
> > > > >
> > > > > Samza processes messages in a singe thread, HTTP requests might
> take
> > > > > hundreds of miliseconds. With the single threaded design the
> > throughput
> > > > > would be very limited, which can be solved with an asynchronous
> > > approach.
> > > > > However Samza documentation explicitely states
> > > > > "*You are strongly discouraged from using threads in your job’s
> > code*".
> > > > >
> > > > > It seems that Samza design suits very well "data transformation"
> > > > scenarios,
> > > > > what is not clear is how well can it support external services?
> > > > >
> > > > > Thanks,
> > > > > Michael Sklyar
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Navina R.
> > >
> >
>
>
>
> --
> Navina R.
>

Reply via email to