Hi,

What would be the best approach for doing "blocking" operations in Samza?

For example, we have a kafka stream of urls for which we need to gather
external data via HTTP (such as alexa rank, get the page title and
headers..). Other scenarios include database access and decision making via
a rule engine.

Samza processes messages in a singe thread, HTTP requests might take
hundreds of miliseconds. With the single threaded design the throughput
would be very limited, which can be solved with an asynchronous approach.
However Samza documentation explicitely states
"*You are strongly discouraged from using threads in your job’s code*".

It seems that Samza design suits very well "data transformation" scenarios,
what is not clear is how well can it support external services?

Thanks,
Michael Sklyar

Reply via email to