Hi Mingjie I would recommend the first option of running one mirrormaker instance pulling from multiple DC's.
A single MM instance will be able to make more efficient use of the machine resources in two ways: 1. You will only have to run one process which will be able to be allocated the full amount of resources 2. Within the process, if you run enough consumer threads, I think that they should be able to rebalance and pick up the load if they don't have anything to do. I'm not 100% sure on this, but 1 still holds. A single MM instance should handle connectivity issues with one DC without affecting the rest of the consumer threads for other DC's. You would gain process isolation running a MM per DC, but this would raise the operational burden and resource requirements. I'm not sure what benefit you'd actually get from process isolation, so I'd recommend against it. However I'd be interested to hear if others do things differently. Daniel. On Thu Jan 29 2015 at 11:14:29 AM Mingjie Lai <m...@apache.org> wrote: > Hi. > > We have a pretty typical data ingestion use case that we use mirrormaker at > one hadoop data center, to mirror kafka data from multiple remote > application data centers. I know mirrormaker can support to consume kafka > data from multiple kafka source, by one instance at one physical node. By > this, we can give one instance of mm multiple consumer config files, so it > can consume data from muti places. > > Another option is to have multiple mirrormaker instances at one node, each > mm instance is dedicated to grab data from one single source data center. > Certainly there will be multiple mm nodes to balance the load. > > The second option looks better since it kind of has an isolation for > different data centers. > > Any recommendation for this kind of data aggregation cases? > > Still new to kafka and mirrormaker. Welcome any information. > > Thanks, > Mingjie >