Scary but thanks! :) We'll start digging into the network and see if we can
find a smoking gun. Appreciate the response, thanks again.

Craig J. Swift
Software Engineer - Data Pipeline
ReturnPath Inc.
Work: 303-999-3220 Cell: 720-560-7038

On Fri, Sep 11, 2015 at 11:29 AM, Steve Miller <st...@idrathernotsay.com>
wrote:

>    I have a vague feeling that I've seen stuff like this when the network
> on the broker that's disappearing is actually unreachable from time to time
> -- though I'd like to believe that's not such an issue when talking to AWS
> (though there could be a lot of screwed-up Internet between you and it,
> depending on exactly wht you're doing).
>
>    One thing you could consider doing would be to look at some of the
> traffic.  wireshark/tshark knows how to decode Kafka transactions, though
> digging through the output is... exciting, because there can be so very
> much of it.  I'd written up some notes on how to do that, which you can see
> at:
>
>
> http://mail-archives.apache.org/mod_mbox/kafka-users/201408.mbox/%3c20140812180358.ga24...@idrathernotsay.com%3E
>
> (though I expect in your case, you'd need to be doing a capture all the
> time, and then figure out when MirrorMaker stops fetching from that broker,
> and then stare at a ton of data to find whatever protocol thing has
> happened -- all of which is ugly enough that I'd been reluctant to mention
> it until now).
>
>         -Steve
>
> On Fri, Sep 11, 2015 at 08:52:21AM -0600, Craig Swift wrote:
> > Just wanted to bump this again and see if the community had any thoughts
> or
> > if we're just missing something stupid. For added context the topic we're
> > reading from has 24 partitions and we see roughly 15k messages per
> minute.
> > As I mentioned before the throughput seems fine, but I'm not entirely
> sure
> > how the MirrorMaker cycles through it's topics/partitions and why it
> would
> > read very slowly or bypass reading from certain partitions.
> >
> > Craig J. Swift
> > Software Engineer - Data Pipeline
> > ReturnPath Inc.
> > Work: 303-999-3220 Cell: 720-560-7038
> >
> > On Wed, Sep 9, 2015 at 9:29 AM, Craig Swift <craig.sw...@returnpath.com>
> > wrote:
> >
> > > Hello,
> > >
> > > Hope everyone is doing well. I was hoping to get some assistance with a
> > > strange issue we're experiencing while using the MirrorMaker to pull
> data
> > > down from an 8 node Kafka cluster in AWS into our data center. Both
> Kafka
> > > clusters and the mirror are using version 0.8.1.1 with dedicated
> Zookeeper
> > > clusters for each cluster respectively (running 3.4.5).
> > >
> > > The problem we're seeing is that the mirror starts up and begins
> consuming
> > > from the cluster on a specific topic. It correctly attaches to all 24
> > > partitions for that topic - but inevitably there are a series of
> partitions
> > > that either don't get read or are read at a very slow rate. Those
> > > partitions are always associated with the same brokers. For example,
> all
> > > partitions on broker 2 won't be read or all partitions on broker 2 and
> 4
> > > won't be read. On restarting the mirror, these 'stuck' partitions may
> stay
> > > the same or move. If they move the backlog is drained very quickly. If
> we
> > > add more mirrors for additional capacity the same situation happens
> except
> > > that each mirror has it's own set of stuck partitions. I've included
> the
> > > mirror's configurations below along with samples from the logs.
> > >
> > > 1) The partition issue seems to happen when the mirror first starts up.
> > > Once in a blue moon it reads from everything normally, but on restart
> it
> > > can easily get back into this state.
> > >
> > > 2) We're fairly sure it isn't a processing/throughput issue. We can
> turn
> > > the mirror off for a while, incur a large backlog of data, and when it
> is
> > > enabled it chews through the data very quickly minus the handful of
> stuck
> > > partitions.
> > >
> > > 3) We've looked at both the zookeeper and broker logs and there doesn't
> > > seem to be anything out of the normal. We see the mirror connecting,
> there
> > > are a few info messages about zookeeper nodes already existing, etc. No
> > > specific errors.
> > >
> > > 4) We've enabled debugging on the mirror and we've noticed that during
> the
> > > zk heartbeat/updates we're missing these messages for the 'stuck'
> > > partitions:
> > >
> > > [2015-09-08 18:38:12,157] DEBUG Reading reply
> sessionid:0x14f956bd57d21ee,
> > > packet:: clientPath:null serverPath:null finished:false header:: 357,5
> > >  replyHeader:: 357,8597251893,0  request::
> > >
> '/consumers/mirror-kafkablk-kafka-gold-east-to-kafkablk-den/offsets/MessageHeadersBody/5,#34303537353838,-1
> > >  response::
> > >
> s{4295371756,8597251893,1439969185754,1441759092134,19500,0,0,0,7,0,4295371756}
> > >  (org.apache.zookeeper.ClientCnxn)
> > >
> > > i.e. we see this message for all the processing partitions, but never
> for
> > > the stuck ones. There are no errors in the log prior to this though,
> and
> > > once in a great while we might see a log entry for one of the stuck
> > > partitions.
> > >
> > > 5) We've checked latency/response time with zookeeper from the brokers
> and
> > > the mirror and it appears fine.
> > >
> > > Mirror consumer config:
> > > group.id=mirror-kafkablk-kafka-gold-east-to-kafkablk-den
> > > consumer.id
> =mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> > > zookeeper.connect=zk.strange.dev.net:2181
> > > fetch.message.max.bytes=15728640
> > > socket.receive.buffer.bytes=64000000
> > > socket.timeout.ms=60000
> > > zookeeper.connection.timeout.ms=60000
> > > zookeeper.session.timeout.ms=30000
> > > zookeeper.sync.time.ms=4000
> > > auto.offset.reset=smallest
> > > auto.commit.interval.ms=20000
> > >
> > > Mirror producer config:
> > > client.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den
> > > metadata.broker.list=kafka00.lan.strange.dev.net:9092,
> > > kafka01.lan.strange.dev.net:9092,kafka02.lan.strange.dev.net:9092,
> > > kafka03.lan.strange.dev.net:9092,kafka04.lan.strange.dev.net:9092
> > > request.required.acks=1
> > > producer.type=async
> > > request.timeout.ms=20000
> > > retry.backoff.ms=1000
> > > message.send.max.retries=6
> > > serializer.class=kafka.serializer.DefaultEncoder
> > > send.buffer.bytes=134217728
> > > compression.codec=gzip
> > >
> > > Mirror startup settings:
> > > --num.streams 2 --num.producers 4
> > >
> > > Any thoughts/suggestions would be very helpful. At this point we're
> > > running out of things to try.
> > >
> > >
> > > Craig J. Swift
> > > Software Engineer
> > >
> > >
> > >
>

Reply via email to