Apart from your data locality problem it sounds like what you want is a workqueue. Kafka's consumer structure doesn't lend itself too well to that use case as a single partition of a topic should only have one consumer instance per logical subscriber of the topic, and that consumer would not be able to mark jobs as completed except in a strict order (while maintaining a processed successfully at least once guarantee). This is not to say it cannot be done, but I believe your workqueue would end up working a bit strangely if built with Kafka.
Christian On 10/09/2014 06:13 AM, William Briggs wrote: > Manually managing data locality will become difficult to scale. Kafka is > one potential tool you can use to help scale, but by itself, it will not > solve your problem. If you need the data in near-real time, you could use a > technology like Spark or Storm to stream data from Kafka and perform your > processing. If you can batch the data, you might be better off pulling it > into a distributed filesystem like HDFS, and using MapReduce, Spark or > another scalable processing framework to handle your transformations. Once > you've paid the initial price for moving the document into HDFS, your > network traffic should be fairly manageable; most clusters built for this > purpose will schedule work to be run local to the data, and typically have > separate, high-speed network interfaces and a dedicated switch in order to > optimize intra-cluster communications when moving data is unavoidable. > > -Will > > On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila <albert.v...@augure.com> wrote: > >> Hi >> >> I just came across Kafta when I was trying to find solutions to scale our >> current architecture. >> >> We are currently downloading and processing 6M documents per day from >> online and social media. We have a different workflow for each type of >> document, but some of the steps are keyword extraction, language detection, >> clustering, classification, indexation, .... We are using Gearman to >> dispatch the job to workers and we have some queues on a database. >> >> I'm wondering if we could integrate Kafka on the current workflow and if >> it's feasible. One of our main discussions are if we have to go to a fully >> distributed architecture or to a semi-distributed one. I mean, distribute >> everything or process some steps on the same machine (crawling, keyword >> extraction, language detection, indexation). We don't know which one scales >> more, each one has pros and cont. >> >> Now we have a semi-distributed one as we had network problems taking into >> account the amount of data we were moving around. So now, all documents >> crawled on server X, later on are dispatched through Gearman to the same >> server. What we dispatch on Gearman is only the document id, and the >> document data remains on the crawling server on a Memcached, so the network >> traffic is keep at minimum. >> >> What do you think? >> It's feasible to remove all database queues and Gearman and move to Kafka? >> As Kafka is mainly based on messages I think we should move the messages >> around, should we take into account the network? We may face the same >> problems? >> If so, there is a way to isolate some steps to be processed on the same >> machine, to avoid network traffic? >> >> Any help or comment will be appreciate. And If someone has had a similar >> problem and has knowledge about the architecture approach will be more than >> welcomed. >> >> Thanks >> >