Rajiv, Upgrading from 0.8.2.1 to 0.9.0.0 should also be fine. If you can reproduce this issue in a test environment, that would be great. The following may be helpful in figuring out the issue.
1. Use ack=1 instead of ack=0 will allow the producer to see the error code when the send fails. 2. If the producer sends oversized messages, the network in rate will go up, but the message in rate won't go up. You may want to check the jmx on BytesRejectedPerSec. 3. If the CPU is high on broker, do a bit of profiling and figure out where the CPU is spent. Thanks, Jun On Thu, Dec 17, 2015 at 10:35 PM, Rajiv Kurian <ra...@signalfx.com> wrote: > I was mistaken about the version. We were actually using 0.8.2.1 before > upgrading to 0.9. > > On Thu, Dec 17, 2015 at 6:13 PM, Dana Powers <dana.pow...@gmail.com> > wrote: > > > I don't have much to add on this, but q: what is version 0.8.2.3? I > thought > > the latest in 0.8 series was 0.8.2.2? > > > > -Dana > > On Dec 17, 2015 5:56 PM, "Rajiv Kurian" <ra...@signalfx.com> wrote: > > > > > Yes we are in the process of upgrading to the new producers. But the > > > problem seems deeper than a compatibility issue. We have one > environment > > > where the old producers work with the new 0.9 broker. Further when we > > > reverted our messed up 0.9 environment to 0.8.2.3 the problem with > those > > > topics didn't go away. > > > > > > Didn't see any ZK issues on the brokers. There were other topics on the > > > very same brokers that didn't seem to be affected. > > > > > > On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <j...@confluent.io> wrote: > > > > > > > Yes, the new java producer is available in 0.8.2.x and we recommend > > > people > > > > use that. > > > > > > > > Also, when those producers had the issue, were there any other things > > > weird > > > > in the broker (e.g., broker's ZK session expires)? > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com> > > > wrote: > > > > > > > > > I can't think of anything special about the topics besides the > > clients > > > > > being very old (Java wrappers over Scala). > > > > > > > > > > I do think it was using ack=0. But my guess is that the logging was > > > done > > > > by > > > > > the Kafka producer thread. My application itself was not getting > > > > exceptions > > > > > from Kafka. > > > > > > > > > > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <j...@confluent.io> wrote: > > > > > > > > > > > Hmm, anything special with those 3 topics? Also, the broker log > > shows > > > > > that > > > > > > the producer uses ack=0, which means the producer shouldn't get > > > errors > > > > > like > > > > > > leader not found. Could you clarify on the ack used by the > > producer? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jun > > > > > > > > > > > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian < > ra...@signalfx.com > > > > > > > > wrote: > > > > > > > > > > > > > The topic which stopped working had clients that were only > using > > > the > > > > > old > > > > > > > Java producer that is a wrapper over the Scala producer. Again > it > > > > > seemed > > > > > > to > > > > > > > work perfectly in another of our realms where we have the same > > > > topics, > > > > > > same > > > > > > > producers/consumers etc but with less traffic. > > > > > > > > > > > > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <j...@confluent.io> > > > wrote: > > > > > > > > > > > > > > > Are you using the new java producer? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian < > > > ra...@signalfx.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Jun, > > > > > > > > > Answers inline: > > > > > > > > > > > > > > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <j...@confluent.io > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Rajiv, > > > > > > > > > > > > > > > > > > > > Thanks for reporting this. > > > > > > > > > > > > > > > > > > > > 1. How did you verify that 3 of the topics are corrupted? > > Did > > > > you > > > > > > use > > > > > > > > > > DumpLogSegments tool? Also, is there a simple way to > > > reproduce > > > > > the > > > > > > > > > > corruption? > > > > > > > > > > > > > > > > > > > No I did not. The only reason I had to believe that was no > > > > writers > > > > > > > could > > > > > > > > > write to the topic. I have actually no idea what the > problem > > > > was. I > > > > > > saw > > > > > > > > > very frequent (much more than usual) messages of the form: > > > > > > > > > INFO [kafka-request-handler-2 ] > > > > [kafka.server.KafkaApis > > > > > > > > > ]: [KafkaApi-6] Close connection due to error > handling > > > > > produce > > > > > > > > > request with correlation id 294218 from client id with > ack=0 > > > > > > > > > and also message of the form: > > > > > > > > > INFO [kafka-network-thread-9092-0 ] > > > > > [kafka.network.Processor > > > > > > > > > ]: Closing socket connection to /some ip > > > > > > > > > The cluster was actually a critical one so I had no > recourse > > > but > > > > to > > > > > > > > revert > > > > > > > > > the change (which like noted didn't fix things). I didn't > > have > > > > > enough > > > > > > > > time > > > > > > > > > to debug further. The only way I could fix it with my > limited > > > > Kafka > > > > > > > > > knowledge was (after reverting) deleting the topic and > > > recreating > > > > > it. > > > > > > > > > I had updated a low priority cluster before that worked > just > > > > fine. > > > > > > That > > > > > > > > > gave me the confidence to upgrade this higher priority > > cluster > > > > > which > > > > > > > did > > > > > > > > > NOT work out. So the only way for me to try to reproduce it > > is > > > to > > > > > try > > > > > > > > this > > > > > > > > > on our larger clusters again. But it is critical that we > > don't > > > > mess > > > > > > up > > > > > > > > this > > > > > > > > > high priority cluster so I am afraid to try again. > > > > > > > > > > > > > > > > > > > 2. As Lance mentioned, if you are using snappy, make sure > > > that > > > > > you > > > > > > > > > include > > > > > > > > > > the right snappy jar (1.1.1.7). > > > > > > > > > > > > > > > > > > > Wonder why I don't see Lance's email in this thread. Either > > way > > > > we > > > > > > are > > > > > > > > not > > > > > > > > > using compression of any kind on this topic. > > > > > > > > > > > > > > > > > > > 3. For the CPU issue, could you do a bit profiling to see > > > which > > > > > > > thread > > > > > > > > is > > > > > > > > > > busy and where it's spending time? > > > > > > > > > > > > > > > > > > > Since I had to revert I didn't have the time to profile. > > > > > Intuitively > > > > > > it > > > > > > > > > would seem like the high number of client > disconnects/errors > > > and > > > > > the > > > > > > > > > increased network usage probably has something to do with > the > > > > high > > > > > > CPU > > > > > > > > > (total guess). Again our other (lower traffic) cluster that > > was > > > > > > > upgraded > > > > > > > > > was totally fine so it doesn't seem like it happens all the > > > time. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian < > > > > > ra...@signalfx.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > We had to revert to 0.8.3 because three of our topics > > seem > > > to > > > > > > have > > > > > > > > > gotten > > > > > > > > > > > corrupted during the upgrade. As soon as we did the > > upgrade > > > > > > > producers > > > > > > > > > to > > > > > > > > > > > the three topics I mentioned stopped being able to do > > > writes. > > > > > The > > > > > > > > > clients > > > > > > > > > > > complained (occasionally) about leader not found > > > exceptions. > > > > We > > > > > > > > > restarted > > > > > > > > > > > our clients and brokers but that didn't seem to help. > > > > Actually > > > > > > even > > > > > > > > > after > > > > > > > > > > > reverting to 0.8.3 these three topics were broken. To > fix > > > it > > > > we > > > > > > had > > > > > > > > to > > > > > > > > > > stop > > > > > > > > > > > all clients, delete the topics, create them again and > > then > > > > > > restart > > > > > > > > the > > > > > > > > > > > clients. > > > > > > > > > > > > > > > > > > > > > > I realize this is not a lot of info. I couldn't wait to > > get > > > > > more > > > > > > > > debug > > > > > > > > > > info > > > > > > > > > > > because the cluster was actually being used. Has any > one > > > run > > > > > into > > > > > > > > > > something > > > > > > > > > > > like this? Are there any known issues with old > > > > > > consumers/producers. > > > > > > > > The > > > > > > > > > > > topics that got busted had clients writing to them > using > > > the > > > > > old > > > > > > > Java > > > > > > > > > > > wrapper over the Scala producer. > > > > > > > > > > > > > > > > > > > > > > Here are the steps I took to upgrade. > > > > > > > > > > > > > > > > > > > > > > For each broker: > > > > > > > > > > > > > > > > > > > > > > 1. Stop the broker. > > > > > > > > > > > 2. Restart with the 0.9 broker running with > > > > > > > > > > > inter.broker.protocol.version=0.8.2.X > > > > > > > > > > > 3. Wait for under replicated partitions to go down to > 0. > > > > > > > > > > > 4. Go to step 1. > > > > > > > > > > > Once all the brokers were running the 0.9 code with > > > > > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them > > one > > > > by > > > > > > one > > > > > > > > with > > > > > > > > > > > inter.broker.protocol.version=0.9.0.0 > > > > > > > > > > > > > > > > > > > > > > When reverting I did the following. > > > > > > > > > > > > > > > > > > > > > > For each broker. > > > > > > > > > > > > > > > > > > > > > > 1. Stop the broker. > > > > > > > > > > > 2. Restart with the 0.9 broker running with > > > > > > > > > > > inter.broker.protocol.version=0.8.2.X > > > > > > > > > > > 3. Wait for under replicated partitions to go down to > 0. > > > > > > > > > > > 4. Go to step 1. > > > > > > > > > > > > > > > > > > > > > > Once all the brokers were running 0.9 code with > > > > > > > > > > > inter.broker.protocol.version=0.8.2.X I restarted them > > one > > > > by > > > > > > one > > > > > > > > with > > > > > > > > > > the > > > > > > > > > > > 0.8.2.3 broker code. This however like I mentioned did > > not > > > > fix > > > > > > the > > > > > > > > > three > > > > > > > > > > > broken topics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian < > > > > > > ra...@signalfx.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Now that it has been a bit longer, the spikes I was > > > seeing > > > > > are > > > > > > > gone > > > > > > > > > but > > > > > > > > > > > > the CPU and network in/out on the three brokers that > > were > > > > > > showing > > > > > > > > the > > > > > > > > > > > > spikes are still much higher than before the upgrade. > > > Their > > > > > > CPUs > > > > > > > > have > > > > > > > > > > > > increased from around 1-2% to 12-20%. The network in > on > > > the > > > > > > same > > > > > > > > > > brokers > > > > > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The > > > > network > > > > > > out > > > > > > > > has > > > > > > > > > > gone > > > > > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a > > > > > > > corresponding > > > > > > > > > > > > increase in kafka messages in per second or kafka > bytes > > > in > > > > > per > > > > > > > > second > > > > > > > > > > JMX > > > > > > > > > > > > metrics. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Rajiv > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >