This is probably caused by the general issue in purgatory. Basically, the check on whether a request is satisfied or not and the watcher registration is not atomic. So, in this case, when a replica fetch request comes in, it could be that the byte check is not satisfied, but before the fetch request is put into purgatory, a produce request sneaks in. When there is only a single producer client, this replica fetch request has to wait for the full timeout. When handling producer requests, we address this issue by checking satisfied again after watcher registration. We haven't done that for the fetch request yet.
Thanks, Jun On Sat, Feb 8, 2014 at 1:22 PM, Jay Kreps <jay.kr...@gmail.com> wrote: > Hey guys, > > I was running the end-to-end latency test (kafka.TestEndToEndLatency) and > saw something a little weird. This test runs a producer and a consumer and > sends a single message at a time and measures the round-trip time from the > producer's send to the consumer getting the message. > > With replication-factor=1 I see very consistent performance with end-to-end > latency at 0.4-0.5 ms which is extremely good. > > But with replication factor=2 I see something like this: > > count latency > 1000 1.9 ms > 2000 1.8 ms > 3000 1.4 ms > 4000 1.7 ms > 5000 102.6 ms > 6000 101.4 ms > 7000 102.4 ms > 8000 1.6 ms > 9000 101.5 ms > > This pattern is very reproducible, essentially every 4-5k messages things > slow down to an average round trip of 100ms and then pick back up again. > > Note that this test is not using the new producer. > > Have we seen this before. The issue could be in the producer > acknowledgement or in the highwatermark advancement or fetch request, but I > notice that the default fetch max wait is 100ms which makes me think there > is a bug in the async request handling that causes it to wait until the > timeout. Any ideas? If not I'll file a bug... > > -Jay >