Scary but thanks! :) We'll start digging into the network and see if we can find a smoking gun. Appreciate the response, thanks again.
Craig J. Swift Software Engineer - Data Pipeline ReturnPath Inc. Work: 303-999-3220 Cell: 720-560-7038 On Fri, Sep 11, 2015 at 11:29 AM, Steve Miller <st...@idrathernotsay.com> wrote: > I have a vague feeling that I've seen stuff like this when the network > on the broker that's disappearing is actually unreachable from time to time > -- though I'd like to believe that's not such an issue when talking to AWS > (though there could be a lot of screwed-up Internet between you and it, > depending on exactly wht you're doing). > > One thing you could consider doing would be to look at some of the > traffic. wireshark/tshark knows how to decode Kafka transactions, though > digging through the output is... exciting, because there can be so very > much of it. I'd written up some notes on how to do that, which you can see > at: > > > http://mail-archives.apache.org/mod_mbox/kafka-users/201408.mbox/%3c20140812180358.ga24...@idrathernotsay.com%3E > > (though I expect in your case, you'd need to be doing a capture all the > time, and then figure out when MirrorMaker stops fetching from that broker, > and then stare at a ton of data to find whatever protocol thing has > happened -- all of which is ugly enough that I'd been reluctant to mention > it until now). > > -Steve > > On Fri, Sep 11, 2015 at 08:52:21AM -0600, Craig Swift wrote: > > Just wanted to bump this again and see if the community had any thoughts > or > > if we're just missing something stupid. For added context the topic we're > > reading from has 24 partitions and we see roughly 15k messages per > minute. > > As I mentioned before the throughput seems fine, but I'm not entirely > sure > > how the MirrorMaker cycles through it's topics/partitions and why it > would > > read very slowly or bypass reading from certain partitions. > > > > Craig J. Swift > > Software Engineer - Data Pipeline > > ReturnPath Inc. > > Work: 303-999-3220 Cell: 720-560-7038 > > > > On Wed, Sep 9, 2015 at 9:29 AM, Craig Swift <craig.sw...@returnpath.com> > > wrote: > > > > > Hello, > > > > > > Hope everyone is doing well. I was hoping to get some assistance with a > > > strange issue we're experiencing while using the MirrorMaker to pull > data > > > down from an 8 node Kafka cluster in AWS into our data center. Both > Kafka > > > clusters and the mirror are using version 0.8.1.1 with dedicated > Zookeeper > > > clusters for each cluster respectively (running 3.4.5). > > > > > > The problem we're seeing is that the mirror starts up and begins > consuming > > > from the cluster on a specific topic. It correctly attaches to all 24 > > > partitions for that topic - but inevitably there are a series of > partitions > > > that either don't get read or are read at a very slow rate. Those > > > partitions are always associated with the same brokers. For example, > all > > > partitions on broker 2 won't be read or all partitions on broker 2 and > 4 > > > won't be read. On restarting the mirror, these 'stuck' partitions may > stay > > > the same or move. If they move the backlog is drained very quickly. If > we > > > add more mirrors for additional capacity the same situation happens > except > > > that each mirror has it's own set of stuck partitions. I've included > the > > > mirror's configurations below along with samples from the logs. > > > > > > 1) The partition issue seems to happen when the mirror first starts up. > > > Once in a blue moon it reads from everything normally, but on restart > it > > > can easily get back into this state. > > > > > > 2) We're fairly sure it isn't a processing/throughput issue. We can > turn > > > the mirror off for a while, incur a large backlog of data, and when it > is > > > enabled it chews through the data very quickly minus the handful of > stuck > > > partitions. > > > > > > 3) We've looked at both the zookeeper and broker logs and there doesn't > > > seem to be anything out of the normal. We see the mirror connecting, > there > > > are a few info messages about zookeeper nodes already existing, etc. No > > > specific errors. > > > > > > 4) We've enabled debugging on the mirror and we've noticed that during > the > > > zk heartbeat/updates we're missing these messages for the 'stuck' > > > partitions: > > > > > > [2015-09-08 18:38:12,157] DEBUG Reading reply > sessionid:0x14f956bd57d21ee, > > > packet:: clientPath:null serverPath:null finished:false header:: 357,5 > > > replyHeader:: 357,8597251893,0 request:: > > > > '/consumers/mirror-kafkablk-kafka-gold-east-to-kafkablk-den/offsets/MessageHeadersBody/5,#34303537353838,-1 > > > response:: > > > > s{4295371756,8597251893,1439969185754,1441759092134,19500,0,0,0,7,0,4295371756} > > > (org.apache.zookeeper.ClientCnxn) > > > > > > i.e. we see this message for all the processing partitions, but never > for > > > the stuck ones. There are no errors in the log prior to this though, > and > > > once in a great while we might see a log entry for one of the stuck > > > partitions. > > > > > > 5) We've checked latency/response time with zookeeper from the brokers > and > > > the mirror and it appears fine. > > > > > > Mirror consumer config: > > > group.id=mirror-kafkablk-kafka-gold-east-to-kafkablk-den > > > consumer.id > =mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den > > > zookeeper.connect=zk.strange.dev.net:2181 > > > fetch.message.max.bytes=15728640 > > > socket.receive.buffer.bytes=64000000 > > > socket.timeout.ms=60000 > > > zookeeper.connection.timeout.ms=60000 > > > zookeeper.session.timeout.ms=30000 > > > zookeeper.sync.time.ms=4000 > > > auto.offset.reset=smallest > > > auto.commit.interval.ms=20000 > > > > > > Mirror producer config: > > > client.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den > > > metadata.broker.list=kafka00.lan.strange.dev.net:9092, > > > kafka01.lan.strange.dev.net:9092,kafka02.lan.strange.dev.net:9092, > > > kafka03.lan.strange.dev.net:9092,kafka04.lan.strange.dev.net:9092 > > > request.required.acks=1 > > > producer.type=async > > > request.timeout.ms=20000 > > > retry.backoff.ms=1000 > > > message.send.max.retries=6 > > > serializer.class=kafka.serializer.DefaultEncoder > > > send.buffer.bytes=134217728 > > > compression.codec=gzip > > > > > > Mirror startup settings: > > > --num.streams 2 --num.producers 4 > > > > > > Any thoughts/suggestions would be very helpful. At this point we're > > > running out of things to try. > > > > > > > > > Craig J. Swift > > > Software Engineer > > > > > > > > > >