Hi Hangjun I've explored deploying kafka on yarn and current YARN does not support long running services with locality constraints. Deploying kafka producers / consumers (not brokers) is supported in the apache incubator samza project. Background on YARN limitations can be found here: YARN-371, YARN-1040, YARN-1404, YARN-1412 and YARN-2027. Support for long running services within YARN will likely change with the work that Carlo Curino and team are doing (rayon) which is described in YARN-1051. Background/technical details are described within that JIRA.
Thanks Kam On Tuesday, May 20, 2014 10:40 PM, Hangjun Ye <yehang...@gmail.com> wrote: Hi Steve, Yes, what I want is that Kafka doesn't have to care about machines physically (as an option). Best, Hangjun 2014-05-21 11:46 GMT+08:00 Steve Morin <st...@stevemorin.com>: > Hangjun, > Does having Kafka in Yarn would be a big architectural change from where > it is now? From what I have seen on most typical setup you want machines > optimized for Kafka, not just it on top of hdfs. > -Steve > > > On Tue, May 20, 2014 at 8:37 PM, Hangjun Ye <yehang...@gmail.com> wrote: > > > Thanks Jun and Francois. > > > > We used Kafka 0.8.0 previously. We got some weird error when expanding > > cluster and it couldn't be finished. > > Now we use 0.8.1.1, I would have a try on cluster expansion sometime. > > > > I read the discussion on that jira issue and I agree with points raised > > there. > > HDFS was also improved a lot since then and many issues have been > resolved > > (e.g. SPOF). > > > > We have a team for building and providing storage/computing platform for > > our company and we have already provided a Hadoop cluster. > > If Kafka has an option to store data on HDFS, we just need to allocate > some > > space quota for it on our cluster (and increase it on demand) and it > might > > reduce our operational cost a lot. > > > > Another (and maybe more aggressive) thought is about the deployment. Jun > > has a good point: "HDFS only provides data redundancy, but not > > computational redundancy". If Kafka could be deployed on YARN, it could > > offload some computational resource management to YARN and we don't have > to > > allocate machines physically. Kafka still needs to take care of load > > balance and partition assignment among brokers by itself. > > Many computational frameworks like spark/samza have such an option and > it's > > a big attractive point for us. > > > > Best, > > Hangjun > > > > > > 2014-05-20 21:00 GMT+08:00 François Langelier <f.langel...@gmail.com>: > > > > > Take a look at Camus <https://github.com/linkedin/camus/> > > > > > > > > > > > > François Langelier > > > Étudiant en génie Logiciel - École de Technologie > > > Supérieure<http://www.etsmtl.ca/> > > > Capitaine Club Capra <http://capra.etsmtl.ca/> > > > VP-Communication - CS Games <http://csgames.org> 2014 > > > Jeux de Génie <http://www.jdgets.com/> 2011 à 2014 > > > Argentier Fraternité du Piranha <http://fraternitedupiranha.com/> > > > 2012-2014 > > > Comité Organisateur Olympiades ÉTS 2012 > > > Compétition Québécoise d'Ingénierie 2012 - Compétition Senior > > > > > > > > > On 19 May 2014 05:28, Hangjun Ye <yehang...@gmail.com> wrote: > > > > > > > Hi there, > > > > > > > > I recently started to use Kafka for our data analysis pipeline and it > > > works > > > > very well. > > > > > > > > One problem to us so far is expanding our cluster when we need more > > > storage > > > > space. > > > > Kafka provides some scripts for helping do this but the process > wasn't > > > > smooth. > > > > > > > > To make it work perfectly, seems Kafka needs to do some jobs that a > > > > distributed file system has already done. > > > > So just wondering if any thoughts to make Kafka work on top of HDFS? > > > Maybe > > > > make the Kafka storage engine pluggable and HDFS is one option? > > > > > > > > The pros might be that HDFS has already handled storage management > > > > (replication, corrupted disk/machine, migration, load balance, etc.) > > very > > > > well and it frees Kafka and the users from the burden, and the cons > > might > > > > be performance degradation. > > > > As Kafka does very well on performance, possibly even with some > degree > > of > > > > degradation, it's still competitive for the most situations. > > > > > > > > Best, > > > > -- > > > > Hangjun Ye > > > > > > > > > > > > > > > -- > > Hangjun Ye > > > -- Hangjun Ye