Hi

I'm evaluating Spark streaming to see if it fits to scale or current
architecture.

We are currently downloading and processing 6M documents per day from
online and social media. We have a different workflow for each type of
document, but some of the steps are keyword extraction, language detection,
clustering, classification, indexation, .... We are using Gearman to
dispatch the job to workers and we have some queues on a database.
Everything is in near real time.

I'm wondering if we could integrate Spark streaming on the current workflow
and if it's feasible. One of our main discussions are if we have to go to a
fully distributed architecture or to a semi-distributed one. I mean,
distribute everything or process some steps on the same machine (crawling,
keyword extraction, language detection, indexation). We don't know which
one scales more, each one has pros and cont.

Now we have a semi-distributed one as we had network problems taking into
account the amount of data we were moving around. So now, all documents
crawled on server X, later on are dispatched through Gearman to the same
server. What we dispatch on Gearman is only the document id, and the
document data remains on the crawling server on a Memcached, so the network
traffic is keep at minimum.

It's feasible to remove all database queues and Gearman and move to Spark
streaming? We are evaluating to add Kakta to the system too.
Is anyone using Spark streaming for a system like ours?
Should we worry about the network traffic? or it's something Spark can
manage without problems. Every document is arround 50k (300Gb a day +/-).
If we wanted to isolate some steps to be processed on the same machine/s
(or give priority), is something we could do with Spark?

Any help or comment will be appreciate. And If someone has had a similar
problem and has knowledge about the architecture approach will be more than
welcomed.

Thanks

Reply via email to