This kind of sounds to me like there’s packet loss somewhere and TCP is closing the window to try to limit congestion. But from the snippets you posted, I didn’t see any sacks in the tcpdump output. If there *are* sacks, that’d be a strong indicator of loss somewhere, whether it’s in the network or it’s in some host that’s being overwhelmed.
I didn’t have a chance to do the header math to see if TCP’s advertising a small window in the lossy case you posted. But I figured I’d mention this just in case it’s useful. -Steve > On Dec 6, 2017, at 5:27 PM, tao xiao <xiaotao...@gmail.com> wrote: > > Mirror mare is placed to close to target and send/receive buffer size set > to 10MB which is the result of bandwidth-delay product. OS level tcp buffer > has also been increased to 16MB max > >> On Wed, 6 Dec 2017 at 15:19 Jan Filipiak <jan.filip...@trivago.com> wrote: >> >> Hi, >> >> two questions. Is your MirrorMaker collocated with the source or the >> target? >> what are the send and receive buffer sizes on the connections that do span >> across WAN? >> >> Hope we can get you some help. >> >> Best jan >> >> >> >>> On 06.12.2017 14:36, Xu, Zhaohui wrote: >>> Any update on this issue? >>> >>> We also run into similar situation recently. The mirrormaker is >> leveraged to replicate messages between clusters in different dc. But >> sometimes a portion of partitions are with high consumer lag and tcpdump >> also shows similar packet delivery pattern. The behavior is sort of weird >> and is not self-explaining. Wondering whether it has anything to do with >> the fact that number of consumers is too large? In our example, we have >> around 100 consumer connections per broker. >>> >>> Regards, >>> Jeff >>> >>> On 12/5/17, 10:14 AM, "tao xiao" <xiaotao...@gmail.com> wrote: >>> >>> Hi, >>> >>> any pointer will be highly appreciated >>> >>>> On Thu, 30 Nov 2017 at 14:56 tao xiao <xiaotao...@gmail.com> wrote: >>>> >>>> Hi There, >>>> >>>> >>>> >>>> We are running into a weird situation when using Mirrormaker to >> replicate >>>> messages between Kafka clusters across datacenter and reach you >> for help in >>>> case you also encountered this kind of problem before or have >> some insights >>>> in this kind of issue. >>>> >>>> >>>> >>>> Here is the scenario. We have setup a deployment where we run 30 >>>> Mirrormaker instances on 30 different nodes. Each Mirrormaker >> instance is >>>> configure with num.streams=1 thus only one consumer runs. The >> topics to >>>> replicate is configure with 100 partitions and data is almost >> evenly >>>> distributed across all partitions. After running a period of >> time, weird >>>> things happened that some of the Mirrormaker instances seems to >> slow down >>>> and consume at a relative slow speed from source Kafka cluster. >> The output >>>> of tcptrack shows the consume rate of problematic instances >> dropped to >>>> ~1MB/s, while the other healthy instances consume at a rate of >> ~3MB/s. As >>>> a result, the consumer lag for corresponding partitions are going >> high. >>>> >>>> >>>> >>>> >>>> After triggering a tcpdump, we noticed the traffic pattern in tcp >>>> connection of problematic Mirrmaker instances is very different >> from >>>> others. Packets flowing in problematic tcp connections are >> relatively small >>>> and seq and ack packets are basically coming in one after >> another. On the >>>> other hand, the packets in healthy tcp connections are coming in a >>>> different pattern, basically several seq packets comes with an >> ack packets. >>>> Below screenshot shows the situation, and these two captures are >> got on the >>>> same mirrormaker node. >>>> >>>> >>>> >>>> problematic connection. ps. 10.kfk.kfk.kfk is kafka broker, >> 10.mm.mm.mm >>>> is Mirrormaker node >>>> >>>> >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2FZ3odjjT&data=02%7C01%7Czhaohxu%40ebay.com%7Ca8efe84f9feb47ecb5fd08d53b85d7ac%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636480368398154028&sdata=2DdGcjPWD7QI7lZ7v7QDN6I53P9tsSTMzEGdw6IywmU%3D&reserved=0 >>>> >>>> >>>> healthy connection >>>> >>>> >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2Fw0A6qHT&data=02%7C01%7Czhaohxu%40ebay.com%7Ca8efe84f9feb47ecb5fd08d53b85d7ac%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636480368398154028&sdata=v52DmmY9LHN2%2F59Hb5Xo77JuLreOA3lfDyq8eHKmISQ%3D&reserved=0 >>>> >>>> >>>> If we stop the problematic Mirrormaker instance and when other >> instances >>>> take over the lagged partitions, they can consume messages >> quickly and >>>> catch up the lag soon. So the broker in source Kafaka cluster is >> supposed >>>> to be good. But if Mirrormaker itself causes the issue, how can >> one tcp >>>> connection is good but others are problematic since the >> connections are all >>>> established in the same manner by Kafka library. >>>> >>>> >>>> >>>> Consumer configuration for Mirrormaker instance as below. >>>> >>>> auto.offset.reset=earliest >>>> >>>> >>>> >> partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor >>>> >>>> heartbeat.interval.ms=10000 >>>> >>>> session.timeout.ms=120000 >>>> >>>> request.timeout.ms=150000 >>>> >>>> receive.buffer.bytes=1048576 >>>> >>>> max.partition.fetch.bytes=2097152 >>>> >>>> fetch.min.bytes=1048576 >>>> >>>> >>>> >>>> Kafka version is 0.10.0.0 and we have Kafka and Mirrormaker run >> on Ubuntu >>>> 14.04 >>>> >>>> >>>> >>>> Any response is appreciated. >>>> >>>> Regards, >>>> >>>> Tao >>>> >>> >>> >> >>