Re: One or multiple instances of MM to aggregate kafka data to one hadoop

Mingjie Lai Fri, 30 Jan 2015 16:13:15 -0800

Really appreciate you guys' recommendations.

On Thu, Jan 29, 2015 at 9:22 AM, Jon Bringhurst <
jbringhu...@linkedin.com.invalid> wrote:


> Hey Mingjie,
>
> Here's how we have our mirror makers configured. For some context, let me
> try to describe this using the example datacenter layout as described in:
>
> https://engineering.linkedin.com/samza/operating-apache-samza-scale
>
> In that example, there are four data centers (A, B, C, and D). However, we
> only need Datacenter A and B to describe this.
>
> Datacenter A mirrors data from local(A) to aggregate(A) as well as
> local(B) to aggregate(A).
>
> Datacenter B mirrors data from local(B) to aggregate(B) as well as
> local(A) to aggregate(B).
>
> The diagram in the article should make easy to visualize. Note that the
> mirror makers are running in the destination datacenter and pull the
> traffic in.
>
> Let's say we have two physical machines (lets call them servers 1 and 2 in
> datacenter A; servers 3 and 4 in datacenter B) in each datacenter dedicated
> to running mirror makers. This is how the layout of mirror maker processes
> would look like:
>
> * Datacenter A MirrorMaker Cluster
>     * Server 1
>         * local(A) to aggregate(A) MirrorMaker Instance
>         * local(B) to aggregate(A) MirrorMaker Instance
>     * Server 2
>         * local(A) to aggregate(A) MirrorMaker Instance
>         * local(B) to aggregate(A) MirrorMaker Instance
>
> * Datacenter B MirrorMaker Cluster
>     * Server 3
>         * local(B) to aggregate(B) MirrorMaker Instance
>         * local(A) to aggregate(B) MirrorMaker Instance
>     * Server 4
>         * local(B) to aggregate(B) MirrorMaker Instance
>         * local(A) to aggregate(B) MirrorMaker Instance
>
> The benefit of this layout is that if the load becomes too high, we would
> then add on another server to each cluster that looks exactly like the
> others in the cluster (easy to provision). If you get really huge, you can
> start creating multiple mirror maker clusters that each handle a specific
> flow (but still have homogeneous processes within each cluster).
>
> Of course, YMMV, but this is what works well for us. :)
>
> -Jon
>
> On Jan 28, 2015, at 3:54 PM, Daniel Compton <
> daniel.compton.li...@gmail.com> wrote:
>
> > Hi Mingjie
> >
> > I would recommend the first option of running one mirrormaker instance
> > pulling from multiple DC's.
> >
> > A single MM instance will be able to make more efficient use of the
> machine
> > resources in two ways:
> > 1. You will only have to run one process which will be able to be
> allocated
> > the full amount of resources
> > 2. Within the process, if you run enough consumer threads, I think that
> > they should be able to rebalance and pick up the load if they don't have
> > anything to do. I'm not 100% sure on this, but 1 still holds.
> >
> > A single MM instance should handle connectivity issues with one DC
> without
> > affecting the rest of the consumer threads for other DC's.
> >
> > You would gain process isolation running a MM per DC, but this would
> raise
> > the operational burden and resource requirements. I'm not sure what
> benefit
> > you'd actually get from process isolation, so I'd recommend against it.
> > However I'd be interested to hear if others do things differently.
> >
> > Daniel.
> >
> > On Thu Jan 29 2015 at 11:14:29 AM Mingjie Lai <m...@apache.org> wrote:
> >
> >> Hi.
> >>
> >> We have a pretty typical data ingestion use case that we use
> mirrormaker at
> >> one hadoop data center, to mirror kafka data from multiple remote
> >> application data centers. I know mirrormaker can support to consume
> kafka
> >> data from multiple kafka source, by one instance at one physical node.
> By
> >> this, we can give one instance of mm multiple consumer config files, so
> it
> >> can consume data from muti places.
> >>
> >> Another option is to have multiple mirrormaker instances at one node,
> each
> >> mm instance is dedicated to grab data from one single source data
> center.
> >> Certainly there will be multiple mm nodes to balance the load.
> >>
> >> The second option looks better since it kind of has an isolation for
> >> different data centers.
> >>
> >> Any recommendation for this kind of data aggregation cases?
> >>
> >> Still new to kafka and mirrormaker. Welcome any information.
> >>
> >> Thanks,
> >> Mingjie
> >>
>
>

Re: One or multiple instances of MM to aggregate kafka data to one hadoop

Reply via email to