The topic which stopped working had clients that were only using the old
Java producer that is a wrapper over the Scala producer. Again it seemed to
work perfectly in another of our realms where we have the same topics, same
producers/consumers etc but with less traffic.

On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <j...@confluent.io> wrote:

> Are you using the new java producer?
>
> Thanks,
>
> Jun
>
> On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > Hi Jun,
> > Answers inline:
> >
> > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Rajiv,
> > >
> > > Thanks for reporting this.
> > >
> > > 1. How did you verify that 3 of the topics are corrupted? Did you use
> > > DumpLogSegments tool? Also, is there a simple way to reproduce the
> > > corruption?
> > >
> > No I did not. The only reason I had to believe that was no writers could
> > write to the topic. I have actually no idea what the problem was. I saw
> > very frequent (much more than usual) messages of the form:
> > INFO  [kafka-request-handler-2            ] [kafka.server.KafkaApis
> >       ]: [KafkaApi-6] Close connection due to error handling produce
> > request with correlation id 294218 from client id  with ack=0
> > and also message of the form:
> > INFO  [kafka-network-thread-9092-0        ] [kafka.network.Processor
> >       ]: Closing socket connection to /some ip
> > The cluster was actually a critical one so I had no recourse but to
> revert
> > the change (which like noted didn't fix things). I didn't have enough
> time
> > to debug further. The only way I could fix it with my limited Kafka
> > knowledge was (after reverting) deleting the topic and recreating it.
> > I had updated a low priority cluster before that worked just fine. That
> > gave me the confidence to upgrade this higher priority cluster which did
> > NOT work out. So the only way for me to try to reproduce it is to try
> this
> > on our larger clusters again. But it is critical that we don't mess up
> this
> > high priority cluster so I am afraid to try again.
> >
> > > 2. As Lance mentioned, if you are using snappy, make sure that you
> > include
> > > the right snappy jar (1.1.1.7).
> > >
> > Wonder why I don't see Lance's email in this thread. Either way we are
> not
> > using compression of any kind on this topic.
> >
> > > 3. For the CPU issue, could you do a bit profiling to see which thread
> is
> > > busy and where it's spending time?
> > >
> > Since I had to revert I didn't have the time to profile. Intuitively it
> > would seem like the high number of client disconnects/errors and the
> > increased network usage probably has something to do with the high CPU
> > (total guess). Again our other (lower traffic) cluster that was upgraded
> > was totally fine so it doesn't seem like it happens all the time.
> >
> > >
> > > Jun
> > >
> > >
> > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > We had to revert to 0.8.3 because three of our topics seem to have
> > gotten
> > > > corrupted during the upgrade. As soon as we did the upgrade producers
> > to
> > > > the three topics I mentioned stopped being able to do writes. The
> > clients
> > > > complained (occasionally) about leader not found exceptions. We
> > restarted
> > > > our clients and brokers but that didn't seem to help. Actually even
> > after
> > > > reverting to 0.8.3 these three topics were broken. To fix it we had
> to
> > > stop
> > > > all clients, delete the topics, create them again and then restart
> the
> > > > clients.
> > > >
> > > > I realize this is not a lot of info. I couldn't wait to get more
> debug
> > > info
> > > > because the cluster was actually being used. Has any one run into
> > > something
> > > > like this? Are there any known issues with old consumers/producers.
> The
> > > > topics that got busted had clients writing to them using the old Java
> > > > wrapper over the Scala producer.
> > > >
> > > > Here are the steps I took to upgrade.
> > > >
> > > > For each broker:
> > > >
> > > > 1. Stop the broker.
> > > > 2. Restart with the 0.9 broker running with
> > > > inter.broker.protocol.version=0.8.2.X
> > > > 3. Wait for under replicated partitions to go down to 0.
> > > > 4. Go to step 1.
> > > > Once all the brokers were running the 0.9 code with
> > > > inter.broker.protocol.version=0.8.2.X we restarted them one by one
> with
> > > > inter.broker.protocol.version=0.9.0.0
> > > >
> > > > When reverting I did the following.
> > > >
> > > > For each broker.
> > > >
> > > > 1. Stop the broker.
> > > > 2. Restart with the 0.9 broker running with
> > > > inter.broker.protocol.version=0.8.2.X
> > > > 3. Wait for under replicated partitions to go down to 0.
> > > > 4. Go to step 1.
> > > >
> > > > Once all the brokers were running 0.9 code with
> > > > inter.broker.protocol.version=0.8.2.X  I restarted them one by one
> with
> > > the
> > > > 0.8.2.3 broker code. This however like I mentioned did not fix the
> > three
> > > > broken topics.
> > > >
> > > >
> > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com>
> > > wrote:
> > > >
> > > > > Now that it has been a bit longer, the spikes I was seeing are gone
> > but
> > > > > the CPU and network in/out on the three brokers that were showing
> the
> > > > > spikes are still much higher than before the upgrade. Their CPUs
> have
> > > > > increased from around 1-2% to 12-20%. The network in on the same
> > > brokers
> > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out
> has
> > > gone
> > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding
> > > > > increase in kafka messages in per second or kafka bytes in per
> second
> > > JMX
> > > > > metrics.
> > > > >
> > > > > Thanks,
> > > > > Rajiv
> > > > >
> > > >
> > >
> >
>

Reply via email to