Hey Mingjie, Here's how we have our mirror makers configured. For some context, let me try to describe this using the example datacenter layout as described in:
https://engineering.linkedin.com/samza/operating-apache-samza-scale In that example, there are four data centers (A, B, C, and D). However, we only need Datacenter A and B to describe this. Datacenter A mirrors data from local(A) to aggregate(A) as well as local(B) to aggregate(A). Datacenter B mirrors data from local(B) to aggregate(B) as well as local(A) to aggregate(B). The diagram in the article should make easy to visualize. Note that the mirror makers are running in the destination datacenter and pull the traffic in. Let's say we have two physical machines (lets call them servers 1 and 2 in datacenter A; servers 3 and 4 in datacenter B) in each datacenter dedicated to running mirror makers. This is how the layout of mirror maker processes would look like: * Datacenter A MirrorMaker Cluster * Server 1 * local(A) to aggregate(A) MirrorMaker Instance * local(B) to aggregate(A) MirrorMaker Instance * Server 2 * local(A) to aggregate(A) MirrorMaker Instance * local(B) to aggregate(A) MirrorMaker Instance * Datacenter B MirrorMaker Cluster * Server 3 * local(B) to aggregate(B) MirrorMaker Instance * local(A) to aggregate(B) MirrorMaker Instance * Server 4 * local(B) to aggregate(B) MirrorMaker Instance * local(A) to aggregate(B) MirrorMaker Instance The benefit of this layout is that if the load becomes too high, we would then add on another server to each cluster that looks exactly like the others in the cluster (easy to provision). If you get really huge, you can start creating multiple mirror maker clusters that each handle a specific flow (but still have homogeneous processes within each cluster). Of course, YMMV, but this is what works well for us. :) -Jon On Jan 28, 2015, at 3:54 PM, Daniel Compton <daniel.compton.li...@gmail.com> wrote: > Hi Mingjie > > I would recommend the first option of running one mirrormaker instance > pulling from multiple DC's. > > A single MM instance will be able to make more efficient use of the machine > resources in two ways: > 1. You will only have to run one process which will be able to be allocated > the full amount of resources > 2. Within the process, if you run enough consumer threads, I think that > they should be able to rebalance and pick up the load if they don't have > anything to do. I'm not 100% sure on this, but 1 still holds. > > A single MM instance should handle connectivity issues with one DC without > affecting the rest of the consumer threads for other DC's. > > You would gain process isolation running a MM per DC, but this would raise > the operational burden and resource requirements. I'm not sure what benefit > you'd actually get from process isolation, so I'd recommend against it. > However I'd be interested to hear if others do things differently. > > Daniel. > > On Thu Jan 29 2015 at 11:14:29 AM Mingjie Lai <m...@apache.org> wrote: > >> Hi. >> >> We have a pretty typical data ingestion use case that we use mirrormaker at >> one hadoop data center, to mirror kafka data from multiple remote >> application data centers. I know mirrormaker can support to consume kafka >> data from multiple kafka source, by one instance at one physical node. By >> this, we can give one instance of mm multiple consumer config files, so it >> can consume data from muti places. >> >> Another option is to have multiple mirrormaker instances at one node, each >> mm instance is dedicated to grab data from one single source data center. >> Certainly there will be multiple mm nodes to balance the load. >> >> The second option looks better since it kind of has an isolation for >> different data centers. >> >> Any recommendation for this kind of data aggregation cases? >> >> Still new to kafka and mirrormaker. Welcome any information. >> >> Thanks, >> Mingjie >>
signature.asc
Description: Message signed with OpenPGP using GPGMail