Hi Jun, Answers inline: On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <j...@confluent.io> wrote:
> Rajiv, > > Thanks for reporting this. > > 1. How did you verify that 3 of the topics are corrupted? Did you use > DumpLogSegments tool? Also, is there a simple way to reproduce the > corruption? > No I did not. The only reason I had to believe that was no writers could write to the topic. I have actually no idea what the problem was. I saw very frequent (much more than usual) messages of the form: INFO [kafka-request-handler-2 ] [kafka.server.KafkaApis ]: [KafkaApi-6] Close connection due to error handling produce request with correlation id 294218 from client id with ack=0 and also message of the form: INFO [kafka-network-thread-9092-0 ] [kafka.network.Processor ]: Closing socket connection to /some ip The cluster was actually a critical one so I had no recourse but to revert the change (which like noted didn't fix things). I didn't have enough time to debug further. The only way I could fix it with my limited Kafka knowledge was (after reverting) deleting the topic and recreating it. I had updated a low priority cluster before that worked just fine. That gave me the confidence to upgrade this higher priority cluster which did NOT work out. So the only way for me to try to reproduce it is to try this on our larger clusters again. But it is critical that we don't mess up this high priority cluster so I am afraid to try again. > 2. As Lance mentioned, if you are using snappy, make sure that you include > the right snappy jar (1.1.1.7). > Wonder why I don't see Lance's email in this thread. Either way we are not using compression of any kind on this topic. > 3. For the CPU issue, could you do a bit profiling to see which thread is > busy and where it's spending time? > Since I had to revert I didn't have the time to profile. Intuitively it would seem like the high number of client disconnects/errors and the increased network usage probably has something to do with the high CPU (total guess). Again our other (lower traffic) cluster that was upgraded was totally fine so it doesn't seem like it happens all the time. > > Jun > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com> wrote: > > > We had to revert to 0.8.3 because three of our topics seem to have gotten > > corrupted during the upgrade. As soon as we did the upgrade producers to > > the three topics I mentioned stopped being able to do writes. The clients > > complained (occasionally) about leader not found exceptions. We restarted > > our clients and brokers but that didn't seem to help. Actually even after > > reverting to 0.8.3 these three topics were broken. To fix it we had to > stop > > all clients, delete the topics, create them again and then restart the > > clients. > > > > I realize this is not a lot of info. I couldn't wait to get more debug > info > > because the cluster was actually being used. Has any one run into > something > > like this? Are there any known issues with old consumers/producers. The > > topics that got busted had clients writing to them using the old Java > > wrapper over the Scala producer. > > > > Here are the steps I took to upgrade. > > > > For each broker: > > > > 1. Stop the broker. > > 2. Restart with the 0.9 broker running with > > inter.broker.protocol.version=0.8.2.X > > 3. Wait for under replicated partitions to go down to 0. > > 4. Go to step 1. > > Once all the brokers were running the 0.9 code with > > inter.broker.protocol.version=0.8.2.X we restarted them one by one with > > inter.broker.protocol.version=0.9.0.0 > > > > When reverting I did the following. > > > > For each broker. > > > > 1. Stop the broker. > > 2. Restart with the 0.9 broker running with > > inter.broker.protocol.version=0.8.2.X > > 3. Wait for under replicated partitions to go down to 0. > > 4. Go to step 1. > > > > Once all the brokers were running 0.9 code with > > inter.broker.protocol.version=0.8.2.X I restarted them one by one with > the > > 0.8.2.3 broker code. This however like I mentioned did not fix the three > > broken topics. > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com> > wrote: > > > > > Now that it has been a bit longer, the spikes I was seeing are gone but > > > the CPU and network in/out on the three brokers that were showing the > > > spikes are still much higher than before the upgrade. Their CPUs have > > > increased from around 1-2% to 12-20%. The network in on the same > brokers > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out has > gone > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding > > > increase in kafka messages in per second or kafka bytes in per second > JMX > > > metrics. > > > > > > Thanks, > > > Rajiv > > > > > >