For future reference: This bug does not appear anymore in 1.1.0 On Fri, Dec 15, 2017 at 3:25 PM, Rob Verkuylen <r...@verkuylen.net> wrote:
> Hi, > > After upgrading to 1.0 we're getting strange producer/broker behaviour not > experienced on <1.0. > > As a test we run a single threaded producer just sending "TEST" against > our cluster with the following producer settings, on a topic with > replica's=3 and min.isr=2: > linger.ms=10 > acks=all > retries=1000 > batch=16k > retry.backoff.ms=1000 > > Using the callback on send we immediately see a huge lag in the amount of > acks coming back(600k+), while on 0.11 this hovers around 4k-20k max). At > the same time we see a drop in the producer sending msg/s, in about > 90seconds this drops to 0. After 10minutes of silence all we see a list of > network exceptions like these on all partitions: "Got error produce > response with correlation id X on topic-partition test-topic, retrying (999 > attempts left). Error: NETWORK_EXCEPTION" Then short continuation on sends > but quickly the same behaviour. > > Now for the kicker: Staring another thread after the first experiences > this, producing on the same topic, same groupid, will 'release' the first > thread and all acks are returned as normal and behaviour returns to normal. > No issues are experienced when acks=1. Kafka logs show no issues at default > log levels, havent had the opportunity to test further of with more fine > grained log levels. The brokers run default settings with maybe the special > that inter broker protocol is 1.0, but client protocol is still set to > 0.9.0. Testing done above is with client ranging from 0.9 upto 1.0, all > showing the same behaviour. > > Downgrading the entire cluster back to 0.11.0.2 same settings, same > clients, same tests and all is well. Could this be a bug? > > Thanks, > Rob > > >