This kind of sounds to me like there’s packet loss somewhere and TCP is closing 
the window to try to limit congestion.  But from the snippets you posted, I 
didn’t see any sacks in the tcpdump output.  If there *are* sacks, that’d be a 
strong indicator of loss somewhere, whether it’s in the network or it’s in some 
host that’s being overwhelmed.

I didn’t have a chance to do the header math to see if TCP’s advertising a 
small window in the lossy case you posted.  But I figured I’d mention this just 
in case it’s useful.

    -Steve

> On Dec 6, 2017, at 5:27 PM, tao xiao <xiaotao...@gmail.com> wrote:
> 
> Mirror mare is placed to close to target and send/receive buffer size set
> to 10MB which is the result of bandwidth-delay product. OS level tcp buffer
> has also been increased to 16MB max
> 
>> On Wed, 6 Dec 2017 at 15:19 Jan Filipiak <jan.filip...@trivago.com> wrote:
>> 
>> Hi,
>> 
>> two questions. Is your MirrorMaker collocated with the source or the
>> target?
>> what are the send and receive buffer sizes on the connections that do span
>> across WAN?
>> 
>> Hope we can get you some help.
>> 
>> Best jan
>> 
>> 
>> 
>>> On 06.12.2017 14:36, Xu, Zhaohui wrote:
>>> Any update on this issue?
>>> 
>>> We also run into similar situation recently. The mirrormaker is
>> leveraged to replicate messages between clusters in different dc. But
>> sometimes a portion of partitions are with high consumer lag and tcpdump
>> also shows similar packet delivery pattern. The behavior is sort of weird
>> and is not self-explaining. Wondering whether it has anything to do with
>> the fact that number of consumers is too large?  In our example, we have
>> around 100 consumer connections per broker.
>>> 
>>> Regards,
>>> Jeff
>>> 
>>> On 12/5/17, 10:14 AM, "tao xiao" <xiaotao...@gmail.com> wrote:
>>> 
>>>     Hi,
>>> 
>>>     any pointer will be highly appreciated
>>> 
>>>>     On Thu, 30 Nov 2017 at 14:56 tao xiao <xiaotao...@gmail.com> wrote:
>>>> 
>>>> Hi There,
>>>> 
>>>> 
>>>> 
>>>> We are running into a weird situation when using Mirrormaker to
>> replicate
>>>> messages between Kafka clusters across datacenter and reach you
>> for help in
>>>> case you also encountered this kind of problem before or have
>> some insights
>>>> in this kind of issue.
>>>> 
>>>> 
>>>> 
>>>> Here is the scenario. We have setup a deployment where we run 30
>>>> Mirrormaker instances on 30 different nodes. Each Mirrormaker
>> instance is
>>>> configure with num.streams=1 thus only one consumer runs. The
>> topics to
>>>> replicate is configure with 100 partitions and data is almost
>> evenly
>>>> distributed across all partitions. After running a period of
>> time, weird
>>>> things happened that some of the Mirrormaker instances seems to
>> slow down
>>>> and consume at a relative slow speed from source Kafka cluster.
>> The output
>>>> of tcptrack shows the consume rate of problematic instances
>> dropped to
>>>> ~1MB/s, while the other healthy instances consume at a rate of
>> ~3MB/s. As
>>>> a result, the consumer lag for corresponding partitions are going
>> high.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> After triggering a tcpdump, we noticed the traffic pattern in tcp
>>>> connection of problematic Mirrmaker instances is very different
>> from
>>>> others. Packets flowing in problematic tcp connections are
>> relatively small
>>>> and seq and ack packets are basically coming in one after
>> another. On the
>>>> other hand, the packets in healthy tcp connections are coming in a
>>>> different pattern, basically several seq packets comes with an
>> ack packets.
>>>> Below screenshot shows the situation, and these two captures are
>> got on the
>>>> same mirrormaker node.
>>>> 
>>>> 
>>>> 
>>>> problematic connection.  ps. 10.kfk.kfk.kfk is kafka broker,
>> 10.mm.mm.mm
>>>> is Mirrormaker node
>>>> 
>>>> 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2FZ3odjjT&data=02%7C01%7Czhaohxu%40ebay.com%7Ca8efe84f9feb47ecb5fd08d53b85d7ac%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636480368398154028&sdata=2DdGcjPWD7QI7lZ7v7QDN6I53P9tsSTMzEGdw6IywmU%3D&reserved=0
>>>> 
>>>> 
>>>> healthy connection
>>>> 
>>>> 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2Fw0A6qHT&data=02%7C01%7Czhaohxu%40ebay.com%7Ca8efe84f9feb47ecb5fd08d53b85d7ac%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636480368398154028&sdata=v52DmmY9LHN2%2F59Hb5Xo77JuLreOA3lfDyq8eHKmISQ%3D&reserved=0
>>>> 
>>>> 
>>>> If we stop the problematic Mirrormaker instance and when other
>> instances
>>>> take over the lagged partitions, they can consume messages
>> quickly and
>>>> catch up the lag soon. So the broker in source Kafaka cluster is
>> supposed
>>>> to be good. But if Mirrormaker itself causes the issue, how can
>> one tcp
>>>> connection is good but others are problematic since the
>> connections are all
>>>> established in the same manner by Kafka library.
>>>> 
>>>> 
>>>> 
>>>> Consumer configuration for Mirrormaker instance as below.
>>>> 
>>>> auto.offset.reset=earliest
>>>> 
>>>> 
>>>> 
>> partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
>>>> 
>>>> heartbeat.interval.ms=10000
>>>> 
>>>> session.timeout.ms=120000
>>>> 
>>>> request.timeout.ms=150000
>>>> 
>>>> receive.buffer.bytes=1048576
>>>> 
>>>> max.partition.fetch.bytes=2097152
>>>> 
>>>> fetch.min.bytes=1048576
>>>> 
>>>> 
>>>> 
>>>> Kafka version is 0.10.0.0 and we have Kafka and Mirrormaker run
>> on Ubuntu
>>>> 14.04
>>>> 
>>>> 
>>>> 
>>>> Any response is appreciated.
>>>> 
>>>> Regards,
>>>> 
>>>> Tao
>>>> 
>>> 
>>> 
>> 
>> 

Reply via email to