Hi Gwen Your recommendations in the field to partition off non-cluster nodes and reserve them for kafka brokers totally make sense given current YARN limitations. I'm familiar with the llama hacks - effectively reserving containers with dummy processes that just sit there and then running the 'real' processes is a hack fest there is no doubt. YARN coupling the container lifecycle with the process lifecycle was an early basic design decision that is hard to change at this stage. On the other hand, I do think HDFS colocation is required if the app master provides an installation option. Per Jay's point - you may want to distribute config changes and/or version upgrades to brokers via HDFS. Regarding YARN IO, YARN-1711 at least is headed this way by virtue of quotas. I do hope YARN can eventually manage long running services effectively. I think it's no coincidence that as YARN evolves the difference between YARN and a cluster manager like ambari shrink.
On Wednesday, July 23, 2014 6:41 PM, Gwen Shapira <gshap...@cloudera.com> wrote: Hi, Can we discuss for a moment the use-case of Kafka-on-YARN? I (as Cloudera field engineer) typically advise my customers to install Kafka on their own nodes, to allow Kafka uninterrupted access to disks. Hadoop processes tend to be a bit IO heavy. Also, I can't see any benefit from co-locating Kafka and HDFS. Since YARN does not manage IO yet, running Kafka on Hadoop cluster with YARN won't solve this problem in the near future. The other problem is that we typically want brokers to be long running, and YARN is poorly designed for that (see our hacks for Llama as an example). And yet another problem: For resource management to work, we need to be able to add and take away resources from a process. AFAIK, the YARN re-allocated memory for Java processes is to kill them (since there's no good way to force Java to give back memory to the OS). I doubt we want to do that for Kafka. I'd love to hear from those interested in Kafka+YARN what do they expect to gain out of the combination. Gwen On Wed, Jul 23, 2014 at 2:37 PM, hsy...@gmail.com <hsy...@gmail.com> wrote: > Hi guys, > > Kafka is getting more and more popular and in most cases people run kafka > as long-term service in the cluster. Is there a discussion of running kafka > on yarn cluster which we can utilize the convenient configuration/resource > management and HA. I think there is a big potential and requirement for > that. > I found a project https://github.com/kkasravi/kafka-yarn. But is there a > official roadmap/plan for this? > > Thank you very much! > > Best, > Siyuan