Hi,

After upgrading to 1.0 we're getting strange producer/broker behaviour not
experienced on <1.0.

As a test we run a single threaded producer just sending "TEST" against our
cluster with the following producer settings, on a topic with replica's=3
and min.isr=2:
linger.ms=10
acks=all
retries=1000
batch=16k
retry.backoff.ms=1000

Using the callback on send we immediately see a huge lag in the amount of
acks coming back(600k+), while on 0.11 this hovers around 4k-20k max). At
the same time we see a drop in the producer sending msg/s, in about
90seconds this drops to 0. After 10minutes of silence all we see a list of
network exceptions like these on all partitions: "Got error produce
response with correlation id X on topic-partition test-topic, retrying (999
attempts left). Error: NETWORK_EXCEPTION" Then short continuation on sends
but quickly the same behaviour.

Now for the kicker: Staring another thread after the first experiences
this, producing on the same topic, same groupid, will 'release' the first
thread and all acks are returned as normal and behaviour returns to normal.
No issues are experienced when acks=1. Kafka logs show no issues at default
log levels, havent had the opportunity to test further of with more fine
grained log levels. The brokers run default settings with maybe the special
that inter broker protocol is 1.0, but client protocol is still set to
0.9.0. Testing done above is with client ranging from 0.9 upto 1.0, all
showing the same behaviour.

Downgrading the entire cluster back to 0.11.0.2 same settings, same
clients, same tests and all is well. Could this be a bug?

Thanks,
  Rob

Reply via email to