Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Rajiv Kurian Thu, 17 Dec 2015 22:36:19 -0800

I was mistaken about the version. We were actually using 0.8.2.1 before
upgrading to 0.9.


On Thu, Dec 17, 2015 at 6:13 PM, Dana Powers <dana.pow...@gmail.com> wrote:

> I don't have much to add on this, but q: what is version 0.8.2.3? I thought
> the latest in 0.8 series was 0.8.2.2?
>
> -Dana
> On Dec 17, 2015 5:56 PM, "Rajiv Kurian" <ra...@signalfx.com> wrote:
>
> > Yes we are in the process of upgrading to the new producers. But the
> > problem seems deeper than a compatibility issue. We have one environment
> > where the old producers work with the new 0.9 broker. Further when we
> > reverted our messed up 0.9 environment to 0.8.2.3 the problem with those
> > topics didn't go away.
> >
> > Didn't see any ZK issues on the brokers. There were other topics on the
> > very same brokers that didn't seem to be affected.
> >
> > On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Yes, the new java producer is available in 0.8.2.x and we recommend
> > people
> > > use that.
> > >
> > > Also, when those producers had the issue, were there any other things
> > weird
> > > in the broker (e.g., broker's ZK session expires)?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > I can't think of anything special about the topics besides the
> clients
> > > > being very old (Java wrappers over Scala).
> > > >
> > > > I do think it was using ack=0. But my guess is that the logging was
> > done
> > > by
> > > > the Kafka producer thread. My application itself was not getting
> > > exceptions
> > > > from Kafka.
> > > >
> > > > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Hmm, anything special with those 3 topics? Also, the broker log
> shows
> > > > that
> > > > > the producer uses ack=0, which means the producer shouldn't get
> > errors
> > > > like
> > > > > leader not found. Could you clarify on the ack used by the
> producer?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com
> >
> > > > wrote:
> > > > >
> > > > > > The topic which stopped working had clients that were only using
> > the
> > > > old
> > > > > > Java producer that is a wrapper over the Scala producer. Again it
> > > > seemed
> > > > > to
> > > > > > work perfectly in another of our realms where we have the same
> > > topics,
> > > > > same
> > > > > > producers/consumers etc but with less traffic.
> > > > > >
> > > > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <j...@confluent.io>
> > wrote:
> > > > > >
> > > > > > > Are you using the new java producer?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <
> > ra...@signalfx.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > > Answers inline:
> > > > > > > >
> > > > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <j...@confluent.io>
> > > wrote:
> > > > > > > >
> > > > > > > > > Rajiv,
> > > > > > > > >
> > > > > > > > > Thanks for reporting this.
> > > > > > > > >
> > > > > > > > > 1. How did you verify that 3 of the topics are corrupted?
> Did
> > > you
> > > > > use
> > > > > > > > > DumpLogSegments tool? Also, is there a simple way to
> > reproduce
> > > > the
> > > > > > > > > corruption?
> > > > > > > > >
> > > > > > > > No I did not. The only reason I had to believe that was no
> > > writers
> > > > > > could
> > > > > > > > write to the topic. I have actually no idea what the problem
> > > was. I
> > > > > saw
> > > > > > > > very frequent (much more than usual) messages of the form:
> > > > > > > > INFO  [kafka-request-handler-2            ]
> > > [kafka.server.KafkaApis
> > > > > > > >       ]: [KafkaApi-6] Close connection due to error handling
> > > > produce
> > > > > > > > request with correlation id 294218 from client id  with ack=0
> > > > > > > > and also message of the form:
> > > > > > > > INFO  [kafka-network-thread-9092-0        ]
> > > > [kafka.network.Processor
> > > > > > > >       ]: Closing socket connection to /some ip
> > > > > > > > The cluster was actually a critical one so I had no recourse
> > but
> > > to
> > > > > > > revert
> > > > > > > > the change (which like noted didn't fix things). I didn't
> have
> > > > enough
> > > > > > > time
> > > > > > > > to debug further. The only way I could fix it with my limited
> > > Kafka
> > > > > > > > knowledge was (after reverting) deleting the topic and
> > recreating
> > > > it.
> > > > > > > > I had updated a low priority cluster before that worked just
> > > fine.
> > > > > That
> > > > > > > > gave me the confidence to upgrade this higher priority
> cluster
> > > > which
> > > > > > did
> > > > > > > > NOT work out. So the only way for me to try to reproduce it
> is
> > to
> > > > try
> > > > > > > this
> > > > > > > > on our larger clusters again. But it is critical that we
> don't
> > > mess
> > > > > up
> > > > > > > this
> > > > > > > > high priority cluster so I am afraid to try again.
> > > > > > > >
> > > > > > > > > 2. As Lance mentioned, if you are using snappy, make sure
> > that
> > > > you
> > > > > > > > include
> > > > > > > > > the right snappy jar (1.1.1.7).
> > > > > > > > >
> > > > > > > > Wonder why I don't see Lance's email in this thread. Either
> way
> > > we
> > > > > are
> > > > > > > not
> > > > > > > > using compression of any kind on this topic.
> > > > > > > >
> > > > > > > > > 3. For the CPU issue, could you do a bit profiling to see
> > which
> > > > > > thread
> > > > > > > is
> > > > > > > > > busy and where it's spending time?
> > > > > > > > >
> > > > > > > > Since I had to revert I didn't have the time to profile.
> > > > Intuitively
> > > > > it
> > > > > > > > would seem like the high number of client disconnects/errors
> > and
> > > > the
> > > > > > > > increased network usage probably has something to do with the
> > > high
> > > > > CPU
> > > > > > > > (total guess). Again our other (lower traffic) cluster that
> was
> > > > > > upgraded
> > > > > > > > was totally fine so it doesn't seem like it happens all the
> > time.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <
> > > > ra...@signalfx.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > We had to revert to 0.8.3 because three of our topics
> seem
> > to
> > > > > have
> > > > > > > > gotten
> > > > > > > > > > corrupted during the upgrade. As soon as we did the
> upgrade
> > > > > > producers
> > > > > > > > to
> > > > > > > > > > the three topics I mentioned stopped being able to do
> > writes.
> > > > The
> > > > > > > > clients
> > > > > > > > > > complained (occasionally) about leader not found
> > exceptions.
> > > We
> > > > > > > > restarted
> > > > > > > > > > our clients and brokers but that didn't seem to help.
> > > Actually
> > > > > even
> > > > > > > > after
> > > > > > > > > > reverting to 0.8.3 these three topics were broken. To fix
> > it
> > > we
> > > > > had
> > > > > > > to
> > > > > > > > > stop
> > > > > > > > > > all clients, delete the topics, create them again and
> then
> > > > > restart
> > > > > > > the
> > > > > > > > > > clients.
> > > > > > > > > >
> > > > > > > > > > I realize this is not a lot of info. I couldn't wait to
> get
> > > > more
> > > > > > > debug
> > > > > > > > > info
> > > > > > > > > > because the cluster was actually being used. Has any one
> > run
> > > > into
> > > > > > > > > something
> > > > > > > > > > like this? Are there any known issues with old
> > > > > consumers/producers.
> > > > > > > The
> > > > > > > > > > topics that got busted had clients writing to them using
> > the
> > > > old
> > > > > > Java
> > > > > > > > > > wrapper over the Scala producer.
> > > > > > > > > >
> > > > > > > > > > Here are the steps I took to upgrade.
> > > > > > > > > >
> > > > > > > > > > For each broker:
> > > > > > > > > >
> > > > > > > > > > 1. Stop the broker.
> > > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > > > 4. Go to step 1.
> > > > > > > > > > Once all the brokers were running the 0.9 code with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them
> one
> > > by
> > > > > one
> > > > > > > with
> > > > > > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > > > > > >
> > > > > > > > > > When reverting I did the following.
> > > > > > > > > >
> > > > > > > > > > For each broker.
> > > > > > > > > >
> > > > > > > > > > 1. Stop the broker.
> > > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > > > 4. Go to step 1.
> > > > > > > > > >
> > > > > > > > > > Once all the brokers were running 0.9 code with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them
> one
> > > by
> > > > > one
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > 0.8.2.3 broker code. This however like I mentioned did
> not
> > > fix
> > > > > the
> > > > > > > > three
> > > > > > > > > > broken topics.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> > > > > ra...@signalfx.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Now that it has been a bit longer, the spikes I was
> > seeing
> > > > are
> > > > > > gone
> > > > > > > > but
> > > > > > > > > > > the CPU and network in/out on the three brokers that
> were
> > > > > showing
> > > > > > > the
> > > > > > > > > > > spikes are still much higher than before the upgrade.
> > Their
> > > > > CPUs
> > > > > > > have
> > > > > > > > > > > increased from around 1-2% to 12-20%. The network in on
> > the
> > > > > same
> > > > > > > > > brokers
> > > > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The
> > > network
> > > > > out
> > > > > > > has
> > > > > > > > > gone
> > > > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > > > > > corresponding
> > > > > > > > > > > increase in kafka messages in per second or kafka bytes
> > in
> > > > per
> > > > > > > second
> > > > > > > > > JMX
> > > > > > > > > > > metrics.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Rajiv
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Reply via email to